LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Fangcheng Fu; Guoliang He; Han Lv; Kai Chen; Li Zhang; Ningsheng Ma; Qian Yao; Xin Chen; Youhe Jiang

arxiv: 2508.15601 · v2 · pith:2IAX2M2Ynew · submitted 2025-08-21 · 💻 cs.DC · cs.PF

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang , Youhe Jiang , Guoliang He , Xin Chen , Han Lv , Qian Yao , Ningsheng Ma , Fangcheng Fu

show 1 more author

Kai Chen

This is my paper

Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords mixed-precision inferenceLLM servingGEMM pipelineattention pipelinehardware-aware optimizationlatency reductionthroughput improvementGPU inference

0 comments

The pith

A mixed-precision inference engine for large language models achieves up to 61 percent lower latency and 156 percent higher throughput by using hardware-aware pipelines that generalize without custom kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that mixed-precision techniques for large language models can be made efficient and broadly applicable by replacing fragmented hand-tuned kernels with two unified hardware-aware pipelines. A general matrix multiply pipeline handles weight operations through offline packing and online acceleration, while an attention pipeline supports varying precision levels for queries, keys, and values. Four techniques enable this: hardware-aware weight packing and adaptive head alignment for broad compatibility, plus instruction-level parallelism and a KV memory loading pipeline for better resource use. If correct, this would mean practitioners can deploy mixed-precision models on different GPUs and precision combinations with reliable speedups rather than rebuilding kernels each time. The evaluations across sixteen models and four GPU types support consistent gains in both latency and throughput.

Core claim

The central claim is that TurboMind, the inference engine, delivers generalizable mixed-precision LLM serving through a GEMM pipeline that optimizes matrix operations via offline weight packing and online acceleration, together with an attention pipeline for efficient computation across different Query, Key, and Value precision combinations. These are realized by hardware-aware weight packing, adaptive head alignment, instruction-level parallelism, and a KV memory loading pipeline. Comprehensive tests on sixteen popular LLMs and four representative GPU architectures show up to 61 percent lower serving latency with 30 percent on average and up to 156 percent higher throughput with 58 percent,

What carries the argument

Two hardware-aware mixed-precision pipelines (a GEMM pipeline for matrix operations and an attention pipeline for Query-Key-Value computations) enabled by four techniques: hardware-aware weight packing and adaptive head alignment for generalizability plus instruction-level parallelism and KV memory loading pipeline for efficiency.

Load-bearing premise

The four key techniques automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels for each combination.

What would settle it

Running the same mixed-precision workloads on a previously untested GPU architecture or precision format where performance gains disappear or manual kernel tuning becomes necessary would show the generalizability claim does not hold.

Figures

Figures reproduced from arXiv: 2508.15601 by Fangcheng Fu, Guoliang He, Han Lv, Kai Chen, Li Zhang, Ningsheng Ma, Qian Yao, Xin Chen, Youhe Jiang.

**Figure 1.** Figure 1: Illustration of the memory hierarchy and each step of the mixed-precision inference workflow. inference is challenging because it typically demands intensive memory and compute management. In this section, we first introduce a typical mixed-precision inference workflow, then discusse the key challenges in existing pipelines, and finally present our mixed-precision pipeline. 3.1 Typical Mixed-Precision Inf… view at source ↗

**Figure 2.** Figure 2: Illustration of register memory misalignment with low-bit KV cache. Challenge-I: Global memory coalescing. Modern GPUs achieve peak memory bandwidth when the memory addresses accessed by every thread within a warp are within the same aligned segment of global memory (e.g., 3-byte on Hopper/Ampere). This alignment enables the warp to access contiguous memory regions through one efficient global memory tran… view at source ↗

**Figure 4.** Figure 4: Attention pipeline. The transpose V operation converts V to a column-major tile layout for tensor core compatibility, and the final output O is rearranged back into row-major linear memory before the global write. through the standard memory hierarchy, and dequantizes it to FP16 using I2F scaling (Challenge-IV). Furthermore, the KV memory loading pipeline (detailed in §4.4) overlaps the KV memory loading w… view at source ↗

**Figure 7.** Figure 7: Illustration of fragment storage in step (iv). Singleand two-fragment storage refer to how many packed fragments are written in one store operation. We typically use two-fragment storage for LDS efficiency. directly and efficiently with the same two-instruction sequence from step (ii): an asynchronous copy followed by the matrix-load instruction (e.g., cp.async + LDS on Ampere), without any additional a… view at source ↗

**Figure 6.** Figure 6: Illustration of repacking and permuting operations in step (iii), and the runtime I2F conversion. The values {0-7} in the figure represent the indices of eight elements within a single thread fragment. This procedure guarantees that, after I2F conversion, the data already match the lane layout required by the MMA instruction. slice into registers. In this step, the instruction’s internal crossbar automatic… view at source ↗

**Figure 9.** Figure 9: Overall process of parallel MMA-dequantization. Parallel MMA-dequantization. To minimize the dequantization overhead, we implement a software-pipelined mainloop that orchestrates three concurrent stages across different execution units: (i) Tensor cores execute mma.sync operations on the current tile 𝑘, performing the matrix multiplication using previously dequantized fragments. (ii) INT/FP ALUs run t… view at source ↗

**Figure 10.** Figure 10: Illustration of the KV memory loading pipeline when the context spans two KV tiles (𝐾0, 𝑉0 and 𝐾1, 𝑉1). The kernel executes the load–compute pipeline in 16-value micro-tiles (a macro-tile consists of 64 tokens) [10]. concurrently with the 𝑄𝐾T and 𝑃𝑉 computation. For low-bit KV inference, each computation step includes an additional I2F conversion that dequantizes the low-bit KV cache to FP16 format. Such … view at source ↗

**Figure 11.** Figure 11: Benchmarking results of prefill and decoding latency for attention and GEMM kernels within a single request on the Qwen3 8B AWQ model with 8-bit KV cache. 1 16 64 256512 Batch Size 0.0 1.5 2.9 4.4 5.8 Latency (×10² s) 3.0% 200.0% 267.5% 381.5% 55.3% Attention, A100 1 16 64 256512 Batch Size 0.0 30.4 60.7 91.1 121.5 14.0% 24.0% 25.5% 24.1% 22.6% GEMM, A100 1 16 64 256512 Batch Size 0.0 0.4 0.8 1.2 1.6 27.7… view at source ↗

**Figure 12.** Figure 12: Benchmarking results of accumulated attention and GEMM kernel execution latencies on the Qwen3 8B AWQ model with 8-bit KV cache. 1 4 16 64 Batch Size 0 360 721 1081 1441 Latency (s) 12.3% 210.8% 9.7% 160.0% 19.4% 24.4% 20.3% 1.1% GEMM, Qwen3 8B 1 4 16 64 Batch Size 0 650 1300 1950 2601 3.1% 220.3% 3.7% 164.2% 17.7% 24.5% 18.6% 3.7% GEMM, Qwen3 14B vLLM+MARLIN (INT4×FP16) LMDeploy (INT4×FP16) LMDeploy (FP1… view at source ↗

**Figure 13.** Figure 13: Benchmarking results of our INT4×FP16 kernel versus a general FP16×FP16 GEMM kernel on an A100 GPU. with 8-bit KV cache compression (fp8_e5m2 [68]) as the baseline method. The results demonstrate that our optimized attention kernel achieves average latency reductions of 22.1% (maximum: 48.7%) during prefill operations and 7.6% (maximum: 29.9%) during decode operations compared with the baseline method,… view at source ↗

**Figure 14.** Figure 14: End-to-end experiments comparing LMDeploy with vLLM+MARLIN. Rows show: (1-2) throughput and TTFT latency across batch sizes, (3) latency for online serving at maximum batch size and request rate, and (4) latency under varying request rates on A100 GPU. AVG P90 P95 P96 P97 P98 P99 20.6 31.5 42.3 53.2 64.0 Latency (s) 14.5% Qwen2.5 72B (AWQ) AVG P90 P95 P96 P97 P98 P99 1.2 3.1 4.9 6.7 8.5 28.8% Llama3 8B (A… view at source ↗

**Figure 16.** Figure 16: Latency and throughput comparison between LMDeploy and vLLM+MARLIN on QwQ with math and validation workloads on an A100 GPU. workloads, we conducted specialized evaluations using QwQ AWQ models designed for mathematical reasoning and validation tasks. For throughput performance, LMDeploy achieves an average speedup of 15% compared to vLLM+MARLIN, with peak improvements of 27% observed in validation tasks… view at source ↗

**Figure 15.** Figure 15: Serving latencies of LMDeploy compared with vLLM+MARLIN on different models on A100 GPUs. this comprehensive model suite, LMDeploy achieves an average serving latency improvement of 21.1%, with maximum improvements reaching 47.9%. At the critical P99 latency percentile, LMDeploy delivers an average improvement of 20.0% with peak improvements of 39.2%, ensuring reliable performance even under tail latency… view at source ↗

**Figure 19.** Figure 19: Latency and throughput comparison between LMDeploy and vLLM+MARLIN with the FP8 Qwen3 8B model on an H100 GPU. 1777 2279 2781 3284 3786 Thpt (t/s) 18.3% 74.3% 47.5% A100, Llama2 7B 1605 2184 2764 3344 3923 18.7% 100.0% 42.5% A100, Llama3 8B 1104 1364 1624 1884 2144 12.0% 58.9% 52.6% A100, Llama2 13B 9 137 265 393 520 Thpt (t/s) 12.9% OOM 32.1% A100, Llama2 70B 9 113 216 320 424 13.3% OOM 169.3% A100, Qwen… view at source ↗

**Figure 17.** Figure 17: End-to-end experiments of LMDeploy compared with TensorRT-LLM on L40S and A100 GPUs. AVG P90 P95 P96 P97 P98 P99 5.6 10.0 14.5 18.9 23.4 Latency (s) 39.4% A100, Qwen3 8B AVG P90 P95 P96 P97 P98 P99 19.2 32.6 46.0 59.4 72.9 36.4% A100, Qwen3 32B AVG P90 P95 P96 P97 P98 P99 5.1 7.7 10.4 13.0 15.6 15.5% H100, Qwen3 8B AVG P90 P95 P96 P97 P98 P99 18.4 25.5 32.7 39.9 47.0 6.3% H100, Qwen3 32B 1 4 16 64 256512 … view at source ↗

**Figure 18.** Figure 18: Latency and throughput comparison between LMDeploy and vLLM+MARLIN with 8-bit KV cache. 118.90%, with peak speedups reaching 171.11% across different batch configurations. And LMDeploy reduces TTFT by an average of 52.2%, with maximum improvements of 65.0%. For end-to-end latency across all percentile measurements, LMDeploy delivers an average reduction of 50.3%, with peak improvements of 59.2%. These su… view at source ↗

**Figure 21.** Figure 21: Throughput comparison between different KV precision of LMDeploy with different serving batch sizes on an A100 GPU. different GPU and model configurations and report the optimal variant for each case. LMDeploy consistently outperforms all baselines, achieving an average throughput improvement of 14.1% (maximum: 23.0%) over OmniServeQServe despite the latter’s use of more aggressive 8-bit activation qu… view at source ↗

**Figure 22.** Figure 22: Illustration of wasted memory bandwidth due to uncoalesced memory access (an example of two transactions). Modern GPUs achieve peak memory bandwidth when the memory addresses accessed by every thread within a warp are within the same aligned segment of global memory (e.g., 3-byte on Hopper/Ampere). This alignment enables the warp to access contiguous memory regions through one efficient global memory tran… view at source ↗

**Figure 23.** Figure 23: Illustration of reduced memory throughput due to shared memory bank conflicts (an example of 32-way bank conflict). Useful Data Wasted Padding Hardware MMA Tile (4x8) Data Matrix (4x6) 𝐞𝟕 𝐞𝟔 𝐞𝟓 𝐞𝟒 𝐞𝟑 𝐞𝟐 𝐞𝟏 𝐞𝟎 𝐞𝟕 𝐞𝟓 𝐞𝟑 𝐞𝟏 𝐞𝟔 𝐞𝟒 𝐞𝟐 𝐞𝟎 𝐞𝟕 𝐞𝟓 𝐞𝟑 𝐞𝟏 𝐞𝟔 𝐞𝟒 𝐞𝟐 𝐞𝟎 Isolate & Shift Combine Data in Register Hardware Required Layout Padding Shuffling [PITH_FULL_IMAGE:figures/full_fig_p016_23.png] view at source ↗

**Figure 24.** Figure 24: Illustration of padding and shuffling in MMA data misalignment. extra shuffles. After swizzling, the same logical tile is permuted so that the ldmatrix loads are conflict-free and each lane receives exactly the elements the MMA instruction expects, while the horizontal cp.async writes remain coalesced. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 9 … view at source ↗

**Figure 25.** Figure 25: Illustration of 8×128 byte swizzle unit. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_25.png] view at source ↗

**Figure 26.** Figure 26: Memory bandwidth utilization of LMDeploy ’s attention kernel at different batch sizes. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_26.png] view at source ↗

**Figure 27.** Figure 27: Latency comparison between LMDeploy and vLLM using general inference configuration W16A16KV16 (without mixed-precision formats) on H100 GPUs. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_27.png] view at source ↗

**Figure 28.** Figure 28: Scalability of LMDeploy in multi-GPU serving (tensor parallelism degree = {1, 2, 4, 8}). Scalability of LMDeploy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗

read the original abstract

Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query, Key, and Value precision combinations. These pipelines are enabled by four key techniques: (i) Hardware-aware weight packing and (ii) adaptive head alignment for generalizability, and (iii) instruction-level parallelism and (iv) a KV memory loading pipeline for efficiency. We conduct comprehensive evaluations of LMDeploy powered by TurboMind across sixteen popular LLMs and four representative GPU architectures. Results demonstrate that LMDeploy achieves up to 61% lower serving latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TurboMind gives practical speedups on mixed-precision inference for the tested setups but the no-hand-tuning generalizability claim is not yet strongly backed beyond four GPUs.

read the letter

Hi, the main point is that LMDeploy's TurboMind adds four concrete techniques to handle mixed-precision GEMM and attention without obvious fragmentation on the hardware they tried. Hardware-aware weight packing and adaptive head alignment aim at generalizability, while instruction-level parallelism and the KV loading pipeline target efficiency. They report average gains of 30% lower latency and 58% higher throughput across 16 models and 4 GPUs, with peaks higher, and they open-sourced the code. That combination of integration and public release is the useful part for anyone actually running these workloads. The benchmarks appear straightforward empirical comparisons rather than fitted claims, which keeps the circularity risk low. The soft spot is the generalization story. The stress-test note is fair: results are shown only on four representative GPUs, with no ablations isolating the adaptation logic and no tests on further architectures or precision mixes. The paper treats the four techniques as sufficient to avoid per-combination kernels, but that rests on the limited evaluation rather than direct evidence of broad applicability. If the full text has more implementation details or additional runs, it would help; otherwise the headline numbers are solid for the tested cases but the broader claim stays provisional. This is for practitioners who deploy LLMs and want better mixed-precision serving numbers, plus anyone tracking open inference engines. It is not a theoretical advance but a solid engineering package. I would send it to peer review because the empirical results are concrete, the code is available, and the problem it targets is real, even if the generalization part would likely draw revision requests.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TurboMind, a mixed-precision LLM inference engine integrated into LMDeploy. It proposes two hardware-aware pipelines (GEMM and attention) enabled by four techniques—hardware-aware weight packing, adaptive head alignment, instruction-level parallelism, and KV memory loading pipeline—to achieve generalizability across hardware and precision formats while improving efficiency. Evaluations on 16 LLMs and 4 GPUs report up to 61% lower latency (30% average) and 156% higher throughput (58% average) versus existing mixed-precision frameworks.

Significance. If the empirical gains prove robust with fair baselines and the techniques demonstrate genuine generalizability, the work could meaningfully advance practical mixed-precision LLM serving by reducing reliance on fragmented per-hardware kernels. The open-sourcing of the code is a clear strength that aids reproducibility and community validation.

major comments (1)

[Evaluation] Evaluation section: Results are reported only on four GPU architectures and sixteen models with no ablation isolating the adaptive head alignment or hardware-aware packing logic. This leaves the central claim that the four techniques 'automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels' insufficiently supported, as the manuscript provides no additional architectures, cross-precision stress tests, or code-level evidence of genericity.

minor comments (2)

The abstract and introduction refer to 'existing mixed-precision frameworks' as baselines; the main text should explicitly name the compared systems (e.g., vLLM, TensorRT-LLM variants) and confirm identical precision configurations and batch sizes for each.
Figure captions and tables would benefit from explicit mention of whether error bars or multiple runs are included, given the performance variability typical in LLM serving benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the evaluation section below, clarifying the support for our generalizability claims and outlining planned revisions.

read point-by-point responses

Referee: [Evaluation] Evaluation section: Results are reported only on four GPU architectures and sixteen models with no ablation isolating the adaptive head alignment or hardware-aware packing logic. This leaves the central claim that the four techniques 'automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels' insufficiently supported, as the manuscript provides no additional architectures, cross-precision stress tests, or code-level evidence of genericity.

Authors: We agree that ablation studies isolating the contributions of adaptive head alignment and hardware-aware weight packing would strengthen the evidence for the generalizability claims. In the revised manuscript we will add these ablations, along with expanded discussion of how the hardware-aware pipelines enable adaptation across precision formats without per-hardware kernels. The reported results already show consistent gains (up to 61% lower latency and 156% higher throughput) across 16 models and 4 representative GPU architectures, which were chosen to cover different compute and memory characteristics. The open-sourced code provides direct inspectable evidence of the implementation approach. We will also incorporate additional cross-precision results to the extent space allows. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks with no derived predictions or self-referential equations

full rationale

The paper reports measured latency and throughput improvements from running sixteen LLMs on four GPUs and comparing against existing frameworks. No equations, fitted parameters, or first-principles derivations appear in the abstract or described content; the four techniques are presented as engineering implementations whose benefits are validated by direct experiment rather than by construction from the input data. The central claims therefore remain independent of any self-definition or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on standard assumptions about GPU hardware capabilities and the correctness of mixed-precision arithmetic; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Target GPUs support the described instruction-level parallelism and memory access patterns for the KV loading pipeline.
Invoked to justify the efficiency of the attention and GEMM pipelines across the four tested architectures.

pith-pipeline@v0.9.0 · 5825 in / 1312 out tokens · 31755 ms · 2026-05-21T22:12:06.773078+00:00 · methodology

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
cs.AI 2026-05 unverdicted novelty 7.0

MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
cs.DC 2026-05 unverdicted novelty 7.0

HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
stat.ML 2026-05 unverdicted novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
cs.LG 2026-05 conditional novelty 6.0

Different inference backends alter LLM benchmark scores by up to 16.6 percentage points through optimizations such as prefix caching, CUDA graphs, and custom kernels.
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
cs.LG 2026-05 unverdicted novelty 6.0

Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
cs.DC 2026-05 unverdicted novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 7 Pith papers · 6 internal anchors

[1]

NVIDIA Ampere GPU Architecture Tuning Guide

2024. NVIDIA Ampere GPU Architecture Tuning Guide. https: //docs.nvidia.com/cuda/ampere-tuning-guide/index.html

work page 2024
[2]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24) . 117–134

work page 2024
[3]

AI-MO. 2024. AIMO Validation AIME Dataset. https://huggingface. co/datasets/AI-MO/aimo-validation-aime

work page 2024
[4]

AI-MO. 2024. NuminaMath-CoT: A Large-Scale Math Dataset with Chain of Thought. https://huggingface.co/datasets/AI-MO/ NuminaMath-CoT

work page 2024
[5]

Rajeev Alur, Joseph Devietti, Omar S Navarro Leija, and Nimit Sing- hania. 2017. GPUDrano: Detecting uncoalesced accesses in GPU programs. In Computer Aided Verification: 29th International Confer- ence, CA V 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30. Springer, 507–525

work page 2017
[6]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis...

work page 2022
[7]

Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf

work page 2024
[8]

Girish Biswas and Nandini Mukherjee. 2020. Memory optimized dynamic matrix chain multiplication using shared memory in GPU. In International Conference on Distributed Computing and Internet Technology. Springer, 160–172

work page 2020
[9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . 578–594

work page 2018
[10]

Colfax Research. 2024. CUTLASS Tutorial: Design of a GEMM Ker- nel. https://research.colfax-intl.com/cutlass-tutorial-design-of-a- gemm-kernel/

work page 2024
[11]

Tri Dao. [n. d.]. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations

work page
[12]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page
[13]

Advances in neural information processing systems 35 (2022), 16344–16359

Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

work page 2022
[14]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

work page
[15]

Advances in neural information processing systems 36 (2023), 10088–10115

Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115

work page 2023
[16]

Dayou Du, Shijie Cao, Jianyi Cheng, Ting Cao, and Mao Yang. 2025. BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decod- ing with Low-Bit KV Cache. arXiv preprint arXiv:2503.18773 (2025). 12

work page arXiv 2025
[17]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. Turbotrans- formers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402

work page 2021
[19]

Naznin Fauzia, Louis-Noël Pouchet, and P Sadayappan. 2015. Char- acterizing and enhancing global memory data coalescing on GPUs. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 12–22

work page 2015
[20]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. Marlin: Mixed-precision auto-regressive parallel in- ference on large language models. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 239–251

work page 2025
[22]

Shuang Gao. 2014. Improving gpu shared memory access efficiency. (2014)

work page 2014
[23]

Mark Gebhart, Stephen W Keckler, Brucek Khailany, Ronny Krashin- sky, and William J Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In 2012 45th An- nual IEEE/ACM International Symposium on Microarchitecture . IEEE, 96–106

work page 2012
[24]

GitHub. 2024. The world’s most widely adopted ai developer tool. https://github.com/features/copilot

work page 2024
[25]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. [n. d.]. ZipCache: Accurate and Efficient KV Cache Quanti- zation with Salient Token Identification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page
[27]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quan- tization. Advances in Neural Information Processing Systems 37 (2024), 1270–1303

work page 2024
[28]

Adrian Horga, Ahmed Rezine, Sudipta Chattopadhyay, Petru Eles, and Zebo Peng. 2022. Symbolic identification of shared memory based bank conflicts for GPUs. Journal of Systems Architecture 127 (2022), 102518

work page 2022
[29]

Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies 28, 12 (2023), 15873–15892

work page 2023
[30]

YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Guoliang HE, Xupeng Miao, Ana Klimovic, Bin CUI, Binhang Yuan, and Eiko Yoneki. 2025. Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. In Forty-second International Conference on Machine Learning

work page 2025
[31]

YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin CUI, Ana Klimovic, and Eiko Yoneki. [n. d.]. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. In Eighth Conference on Machine Learning and Systems

work page
[32]

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In International Conference on Machine Learning. PMLR, 21946–21961

work page 2024
[33]

YOUHE JIANG, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disag- gregated Generative Inference of LLMs in Heterogeneous Environ- ment. In The Thirteenth International Conference on Learning Repre- sentations

work page 2025
[34]

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527 (2024)

work page arXiv 2024
[35]

Dae-Hwan Kim. 2017. Evaluation of the performance of GPU global memory coalescing. Evaluation 4, 4 (2017), 1–5

work page 2017
[36]

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, and Hyungjun Kim. 2024. QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference.arXiv preprint arXiv:2402.10076 (2024)

work page arXiv 2024
[37]

Young Jin Kim, Rawn Henry, Raffy Fahim, and Hany Hassan Awadalla

work page
[38]

In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP)

Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production. In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP) . 36–43

work page
[39]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[40]

In Proceedings of the 29th Symposium on Operating Systems Principles

Efficient memory management for large language model serv- ing with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page
[41]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . 663–679

work page 2023
[42]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87–100

work page 2024
[43]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. Qserve: W4a8kv4 quanti- zation and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532 (2024)

work page arXiv 2024
[44]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning- Free Asymmetric 2bit Quantization for KV Cache. In International Conference on Machine Learning . PMLR, 32332–32344

work page 2024
[45]

Justin Luitjens. 2025. CUDA Pro Tip: Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/cuda- pro-tip-increase-performance-with-vectorized-memory-access/

work page 2025
[46]

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. 2024. Benchmarking and dissecting the nvidia hopper gpu archi- tecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 656–667

work page 2024
[47]

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 . 1112–1127

work page 2024
[48]

Mistral AI. 2024. Mixtral 8x22B: Cheaper, Better, Faster, Stronger. https://mistral.ai/news/mixtral-8x22b

work page 2024
[49]

NVIDIA Corporation. 2014. cuDNN: NVIDIA CUDA Deep Neural Network Library. https://developer.nvidia.com/cudnn

work page 2014
[50]

NVIDIA Corporation. 2019. FasterTransformer: Transformer related optimization, including BERT, GPT. https://github.com/NVIDIA/ FasterTransformer

work page 2019
[51]

NVIDIA Corporation. 2020. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass 13

work page 2020
[52]

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/a100/

work page 2020
[53]

NVIDIA Corporation. 2022. NVIDIA GeForce RTX 4090 Graph- ics Card. https://www.nvidia.com/en-us/geforce/graphics-cards/40- series/rtx-4090/

work page 2022
[54]

NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/h100/

work page 2022
[55]

NVIDIA Corporation. 2023. NVIDIA L40S Data Center GPU. https: //www.nvidia.com/en-us/data-center/l40s/

work page 2023
[56]

NVIDIA Corporation. 2024. Efficient GEMM in CUDA. https://github. com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md

work page 2024
[57]

NVIDIA Corporation. 2024. NVIDIA TensorRT 10.0.1 Developer Guide. https://docs.nvidia.com/deeplearning/tensorrt/archives/ tensorrt-1001/developer-guide/index.html

work page 2024
[58]

NVIDIA Corporation. 2025. CUDA C++ Programming Guide, Release 12.9. https://docs.nvidia.com/cuda/cuda-c-programming-guide/

work page 2025
[59]

NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA: ldmatrix Instruction. https://docs.nvidia.com/cuda/parallel-thread- execution/

work page 2025
[60]

NVIDIA Corporation. 2025. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM

work page 2025
[61]

NVIDIA Corporation. 2025. Working with Quantized Types. https://docs.nvidia.com/deeplearning/tensorrt/latest/inference- library/work-quantized-types.html

work page 2025
[62]

OpenAI. 2025. OpenAI o3. https://platform.openai.com/docs/models/ o3

work page 2025
[63]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 118–132

work page 2024
[64]

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. 2023. A study of generative large language model for medical research and healthcare. NPJ digital medicine 6, 1 (2023), 210

work page 2023
[65]

PyTorch Core Team. 2025. PyTorch. https://pytorch.org

work page 2025
[66]

Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[67]

Mariam Rakka, Mohammed E Fouda, Pramod Khargonekar, and Fadi Kurdahi. 2022. Mixed-precision neural networks: A survey. arXiv preprint arXiv:2208.06064 (2022)

work page arXiv 2022
[68]

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

ShareGPT Team. 2023. ShareGPT: Share your wildest ChatGPT con- versations with one click. https://sharegpt.com/

work page 2023
[70]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[71]

Yifan Tan, Haoze Wang, Chao Yan, and Yangdong Deng. 2024. AlignedKV: Reducing Memory Access of KV-Cache with Precision- Aligned Quantization. arXiv preprint arXiv:2409.16546 (2024)

work page arXiv 2024
[72]

vLLM Team. 2024. Quantized KV Cache. https://docs.vllm.ai/en/ stable/features/quantization/quantized_kvcache.html

work page 2024
[73]

vLLM Team. 2024. vLLM Quantization: Supported Hard- ware. https://docs.vllm.ai/en/latest/features/quantization/supported_ hardware.html

work page 2024
[74]

Wright, Less and Hoque, Adnan. 2024. Accelerating Triton Dequantiza- tion Kernels for GPTQ. https://pytorch.org/blog/accelerating-triton/

work page 2024
[75]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099

work page 2023
[76]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538

work page 2022
[78]

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196–209

work page 2024
[79]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20) . 863–879

work page 2020
[80]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems 37 (2024), 62557–62583

work page 2024

Showing first 80 references.

[1] [1]

NVIDIA Ampere GPU Architecture Tuning Guide

2024. NVIDIA Ampere GPU Architecture Tuning Guide. https: //docs.nvidia.com/cuda/ampere-tuning-guide/index.html

work page 2024

[2] [2]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24) . 117–134

work page 2024

[3] [3]

AI-MO. 2024. AIMO Validation AIME Dataset. https://huggingface. co/datasets/AI-MO/aimo-validation-aime

work page 2024

[4] [4]

AI-MO. 2024. NuminaMath-CoT: A Large-Scale Math Dataset with Chain of Thought. https://huggingface.co/datasets/AI-MO/ NuminaMath-CoT

work page 2024

[5] [5]

Rajeev Alur, Joseph Devietti, Omar S Navarro Leija, and Nimit Sing- hania. 2017. GPUDrano: Detecting uncoalesced accesses in GPU programs. In Computer Aided Verification: 29th International Confer- ence, CA V 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30. Springer, 507–525

work page 2017

[6] [6]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis...

work page 2022

[7] [7]

Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf

work page 2024

[8] [8]

Girish Biswas and Nandini Mukherjee. 2020. Memory optimized dynamic matrix chain multiplication using shared memory in GPU. In International Conference on Distributed Computing and Internet Technology. Springer, 160–172

work page 2020

[9] [9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . 578–594

work page 2018

[10] [10]

Colfax Research. 2024. CUTLASS Tutorial: Design of a GEMM Ker- nel. https://research.colfax-intl.com/cutlass-tutorial-design-of-a- gemm-kernel/

work page 2024

[11] [11]

Tri Dao. [n. d.]. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations

work page

[12] [12]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page

[13] [13]

Advances in neural information processing systems 35 (2022), 16344–16359

Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

work page 2022

[14] [14]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

work page

[15] [15]

Advances in neural information processing systems 36 (2023), 10088–10115

Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115

work page 2023

[16] [16]

Dayou Du, Shijie Cao, Jianyi Cheng, Ting Cao, and Mao Yang. 2025. BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decod- ing with Low-Bit KV Cache. arXiv preprint arXiv:2503.18773 (2025). 12

work page arXiv 2025

[17] [17]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. Turbotrans- formers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402

work page 2021

[19] [19]

Naznin Fauzia, Louis-Noël Pouchet, and P Sadayappan. 2015. Char- acterizing and enhancing global memory data coalescing on GPUs. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 12–22

work page 2015

[20] [20]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. Marlin: Mixed-precision auto-regressive parallel in- ference on large language models. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 239–251

work page 2025

[22] [22]

Shuang Gao. 2014. Improving gpu shared memory access efficiency. (2014)

work page 2014

[23] [23]

Mark Gebhart, Stephen W Keckler, Brucek Khailany, Ronny Krashin- sky, and William J Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In 2012 45th An- nual IEEE/ACM International Symposium on Microarchitecture . IEEE, 96–106

work page 2012

[24] [24]

GitHub. 2024. The world’s most widely adopted ai developer tool. https://github.com/features/copilot

work page 2024

[25] [25]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. [n. d.]. ZipCache: Accurate and Efficient KV Cache Quanti- zation with Salient Token Identification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page

[27] [27]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quan- tization. Advances in Neural Information Processing Systems 37 (2024), 1270–1303

work page 2024

[28] [28]

Adrian Horga, Ahmed Rezine, Sudipta Chattopadhyay, Petru Eles, and Zebo Peng. 2022. Symbolic identification of shared memory based bank conflicts for GPUs. Journal of Systems Architecture 127 (2022), 102518

work page 2022

[29] [29]

Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies 28, 12 (2023), 15873–15892

work page 2023

[30] [30]

YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Guoliang HE, Xupeng Miao, Ana Klimovic, Bin CUI, Binhang Yuan, and Eiko Yoneki. 2025. Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. In Forty-second International Conference on Machine Learning

work page 2025

[31] [31]

YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin CUI, Ana Klimovic, and Eiko Yoneki. [n. d.]. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. In Eighth Conference on Machine Learning and Systems

work page

[32] [32]

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In International Conference on Machine Learning. PMLR, 21946–21961

work page 2024

[33] [33]

YOUHE JIANG, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disag- gregated Generative Inference of LLMs in Heterogeneous Environ- ment. In The Thirteenth International Conference on Learning Repre- sentations

work page 2025

[34] [34]

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527 (2024)

work page arXiv 2024

[35] [35]

Dae-Hwan Kim. 2017. Evaluation of the performance of GPU global memory coalescing. Evaluation 4, 4 (2017), 1–5

work page 2017

[36] [36]

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, and Hyungjun Kim. 2024. QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference.arXiv preprint arXiv:2402.10076 (2024)

work page arXiv 2024

[37] [37]

Young Jin Kim, Rawn Henry, Raffy Fahim, and Hany Hassan Awadalla

work page

[38] [38]

In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP)

Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production. In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP) . 36–43

work page

[39] [39]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page

[40] [40]

In Proceedings of the 29th Symposium on Operating Systems Principles

Efficient memory management for large language model serv- ing with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page

[41] [41]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . 663–679

work page 2023

[42] [42]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87–100

work page 2024

[43] [43]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. Qserve: W4a8kv4 quanti- zation and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532 (2024)

work page arXiv 2024

[44] [44]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning- Free Asymmetric 2bit Quantization for KV Cache. In International Conference on Machine Learning . PMLR, 32332–32344

work page 2024

[45] [45]

Justin Luitjens. 2025. CUDA Pro Tip: Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/cuda- pro-tip-increase-performance-with-vectorized-memory-access/

work page 2025

[46] [46]

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. 2024. Benchmarking and dissecting the nvidia hopper gpu archi- tecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 656–667

work page 2024

[47] [47]

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 . 1112–1127

work page 2024

[48] [48]

Mistral AI. 2024. Mixtral 8x22B: Cheaper, Better, Faster, Stronger. https://mistral.ai/news/mixtral-8x22b

work page 2024

[49] [49]

NVIDIA Corporation. 2014. cuDNN: NVIDIA CUDA Deep Neural Network Library. https://developer.nvidia.com/cudnn

work page 2014

[50] [50]

NVIDIA Corporation. 2019. FasterTransformer: Transformer related optimization, including BERT, GPT. https://github.com/NVIDIA/ FasterTransformer

work page 2019

[51] [51]

NVIDIA Corporation. 2020. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass 13

work page 2020

[52] [52]

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/a100/

work page 2020

[53] [53]

NVIDIA Corporation. 2022. NVIDIA GeForce RTX 4090 Graph- ics Card. https://www.nvidia.com/en-us/geforce/graphics-cards/40- series/rtx-4090/

work page 2022

[54] [54]

NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/h100/

work page 2022

[55] [55]

NVIDIA Corporation. 2023. NVIDIA L40S Data Center GPU. https: //www.nvidia.com/en-us/data-center/l40s/

work page 2023

[56] [56]

NVIDIA Corporation. 2024. Efficient GEMM in CUDA. https://github. com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md

work page 2024

[57] [57]

NVIDIA Corporation. 2024. NVIDIA TensorRT 10.0.1 Developer Guide. https://docs.nvidia.com/deeplearning/tensorrt/archives/ tensorrt-1001/developer-guide/index.html

work page 2024

[58] [58]

NVIDIA Corporation. 2025. CUDA C++ Programming Guide, Release 12.9. https://docs.nvidia.com/cuda/cuda-c-programming-guide/

work page 2025

[59] [59]

NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA: ldmatrix Instruction. https://docs.nvidia.com/cuda/parallel-thread- execution/

work page 2025

[60] [60]

NVIDIA Corporation. 2025. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM

work page 2025

[61] [61]

NVIDIA Corporation. 2025. Working with Quantized Types. https://docs.nvidia.com/deeplearning/tensorrt/latest/inference- library/work-quantized-types.html

work page 2025

[62] [62]

OpenAI. 2025. OpenAI o3. https://platform.openai.com/docs/models/ o3

work page 2025

[63] [63]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 118–132

work page 2024

[64] [64]

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. 2023. A study of generative large language model for medical research and healthcare. NPJ digital medicine 6, 1 (2023), 210

work page 2023

[65] [65]

PyTorch Core Team. 2025. PyTorch. https://pytorch.org

work page 2025

[66] [66]

Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

work page 2025

[67] [67]

Mariam Rakka, Mohammed E Fouda, Pramod Khargonekar, and Fadi Kurdahi. 2022. Mixed-precision neural networks: A survey. arXiv preprint arXiv:2208.06064 (2022)

work page arXiv 2022

[68] [68]

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

ShareGPT Team. 2023. ShareGPT: Share your wildest ChatGPT con- versations with one click. https://sharegpt.com/

work page 2023

[70] [70]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[71] [71]

Yifan Tan, Haoze Wang, Chao Yan, and Yangdong Deng. 2024. AlignedKV: Reducing Memory Access of KV-Cache with Precision- Aligned Quantization. arXiv preprint arXiv:2409.16546 (2024)

work page arXiv 2024

[72] [72]

vLLM Team. 2024. Quantized KV Cache. https://docs.vllm.ai/en/ stable/features/quantization/quantized_kvcache.html

work page 2024

[73] [73]

vLLM Team. 2024. vLLM Quantization: Supported Hard- ware. https://docs.vllm.ai/en/latest/features/quantization/supported_ hardware.html

work page 2024

[74] [74]

Wright, Less and Hoque, Adnan. 2024. Accelerating Triton Dequantiza- tion Kernels for GPTQ. https://pytorch.org/blog/accelerating-triton/

work page 2024

[75] [75]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099

work page 2023

[76] [76]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538

work page 2022

[78] [78]

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196–209

work page 2024

[79] [79]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20) . 863–879

work page 2020

[80] [80]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems 37 (2024), 62557–62583

work page 2024