pith. sign in

arxiv: 2508.15601 · v2 · pith:2IAX2M2Ynew · submitted 2025-08-21 · 💻 cs.DC · cs.PF

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords mixed-precision inferenceLLM servingGEMM pipelineattention pipelinehardware-aware optimizationlatency reductionthroughput improvementGPU inference
0
0 comments X

The pith

A mixed-precision inference engine for large language models achieves up to 61 percent lower latency and 156 percent higher throughput by using hardware-aware pipelines that generalize without custom kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that mixed-precision techniques for large language models can be made efficient and broadly applicable by replacing fragmented hand-tuned kernels with two unified hardware-aware pipelines. A general matrix multiply pipeline handles weight operations through offline packing and online acceleration, while an attention pipeline supports varying precision levels for queries, keys, and values. Four techniques enable this: hardware-aware weight packing and adaptive head alignment for broad compatibility, plus instruction-level parallelism and a KV memory loading pipeline for better resource use. If correct, this would mean practitioners can deploy mixed-precision models on different GPUs and precision combinations with reliable speedups rather than rebuilding kernels each time. The evaluations across sixteen models and four GPU types support consistent gains in both latency and throughput.

Core claim

The central claim is that TurboMind, the inference engine, delivers generalizable mixed-precision LLM serving through a GEMM pipeline that optimizes matrix operations via offline weight packing and online acceleration, together with an attention pipeline for efficient computation across different Query, Key, and Value precision combinations. These are realized by hardware-aware weight packing, adaptive head alignment, instruction-level parallelism, and a KV memory loading pipeline. Comprehensive tests on sixteen popular LLMs and four representative GPU architectures show up to 61 percent lower serving latency with 30 percent on average and up to 156 percent higher throughput with 58 percent,

What carries the argument

Two hardware-aware mixed-precision pipelines (a GEMM pipeline for matrix operations and an attention pipeline for Query-Key-Value computations) enabled by four techniques: hardware-aware weight packing and adaptive head alignment for generalizability plus instruction-level parallelism and KV memory loading pipeline for efficiency.

Load-bearing premise

The four key techniques automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels for each combination.

What would settle it

Running the same mixed-precision workloads on a previously untested GPU architecture or precision format where performance gains disappear or manual kernel tuning becomes necessary would show the generalizability claim does not hold.

Figures

Figures reproduced from arXiv: 2508.15601 by Fangcheng Fu, Guoliang He, Han Lv, Kai Chen, Li Zhang, Ningsheng Ma, Qian Yao, Xin Chen, Youhe Jiang.

Figure 1
Figure 1. Figure 1: Illustration of the memory hierarchy and each step of the mixed-precision inference workflow. inference is challenging because it typically demands inten￾sive memory and compute management. In this section, we first introduce a typical mixed-precision inference workflow, then discusse the key challenges in existing pipelines, and finally present our mixed-precision pipeline. 3.1 Typical Mixed-Precision Inf… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of register memory misalignment with low-bit KV cache. Challenge-I: Global memory coalescing. Modern GPUs achieve peak memory bandwidth when the memory ad￾dresses accessed by every thread within a warp are within the same aligned segment of global memory (e.g., 3-byte on Hopper/Ampere). This alignment enables the warp to access contiguous memory regions through one efficient global memory tran… view at source ↗
Figure 4
Figure 4. Figure 4: Attention pipeline. The transpose V operation converts V to a column-major tile layout for tensor core compatibility, and the final output O is rearranged back into row-major linear memory before the global write. through the standard memory hierarchy, and dequantizes it to FP16 using I2F scaling (Challenge-IV). Furthermore, the KV memory loading pipeline (detailed in §4.4) overlaps the KV memory loading w… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of fragment storage in step (iv). Single￾and two-fragment storage refer to how many packed frag￾ments are written in one store operation. We typically use two-fragment storage for LDS efficiency. directly and efficiently with the same two-instruction se￾quence from step (ii): an asynchronous copy followed by the matrix-load instruction (e.g., cp.async + LDS on Ampere), without any additional a… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of repacking and permuting operations in step (iii), and the runtime I2F conversion. The values {0-7} in the figure represent the indices of eight elements within a single thread fragment. This procedure guarantees that, after I2F conversion, the data already match the lane layout required by the MMA instruction. slice into registers. In this step, the instruction’s internal crossbar automatic… view at source ↗
Figure 9
Figure 9. Figure 9: Overall process of parallel MMA-dequantization. Parallel MMA-dequantization. To minimize the dequanti￾zation overhead, we implement a software-pipelined main￾loop that orchestrates three concurrent stages across differ￾ent execution units: (i) Tensor cores execute mma.sync oper￾ations on the current tile 𝑘, performing the matrix multipli￾cation using previously dequantized fragments. (ii) INT/FP ALUs run t… view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the KV memory loading pipeline when the context spans two KV tiles (𝐾0, 𝑉0 and 𝐾1, 𝑉1). The kernel executes the load–compute pipeline in 16-value micro-tiles (a macro-tile consists of 64 tokens) [10]. concurrently with the 𝑄𝐾T and 𝑃𝑉 computation. For low-bit KV inference, each computation step includes an additional I2F conversion that dequantizes the low-bit KV cache to FP16 format. Such … view at source ↗
Figure 11
Figure 11. Figure 11: Benchmarking results of prefill and decoding latency for attention and GEMM kernels within a single request on the Qwen3 8B AWQ model with 8-bit KV cache. 1 16 64 256512 Batch Size 0.0 1.5 2.9 4.4 5.8 Latency (×10² s) 3.0% 200.0% 267.5% 381.5% 55.3% Attention, A100 1 16 64 256512 Batch Size 0.0 30.4 60.7 91.1 121.5 14.0% 24.0% 25.5% 24.1% 22.6% GEMM, A100 1 16 64 256512 Batch Size 0.0 0.4 0.8 1.2 1.6 27.7… view at source ↗
Figure 12
Figure 12. Figure 12: Benchmarking results of accumulated attention and GEMM kernel execution latencies on the Qwen3 8B AWQ model with 8-bit KV cache. 1 4 16 64 Batch Size 0 360 721 1081 1441 Latency (s) 12.3% 210.8% 9.7% 160.0% 19.4% 24.4% 20.3% 1.1% GEMM, Qwen3 8B 1 4 16 64 Batch Size 0 650 1300 1950 2601 3.1% 220.3% 3.7% 164.2% 17.7% 24.5% 18.6% 3.7% GEMM, Qwen3 14B vLLM+MARLIN (INT4×FP16) LMDeploy (INT4×FP16) LMDeploy (FP1… view at source ↗
Figure 13
Figure 13. Figure 13: Benchmarking results of our INT4×FP16 kernel versus a general FP16×FP16 GEMM kernel on an A100 GPU. with 8-bit KV cache compression (fp8_e5m2 [68]) as the base￾line method. The results demonstrate that our optimized at￾tention kernel achieves average latency reductions of 22.1% (maximum: 48.7%) during prefill operations and 7.6% (maxi￾mum: 29.9%) during decode operations compared with the baseline method,… view at source ↗
Figure 14
Figure 14. Figure 14: End-to-end experiments comparing LMDeploy with vLLM+MARLIN. Rows show: (1-2) throughput and TTFT latency across batch sizes, (3) latency for online serving at maximum batch size and request rate, and (4) latency under varying request rates on A100 GPU. AVG P90 P95 P96 P97 P98 P99 20.6 31.5 42.3 53.2 64.0 Latency (s) 14.5% Qwen2.5 72B (AWQ) AVG P90 P95 P96 P97 P98 P99 1.2 3.1 4.9 6.7 8.5 28.8% Llama3 8B (A… view at source ↗
Figure 16
Figure 16. Figure 16: Latency and throughput comparison between LMDeploy and vLLM+MARLIN on QwQ with math and validation workloads on an A100 GPU. workloads, we conducted specialized evaluations using QwQ AWQ models designed for mathematical reasoning and vali￾dation tasks. For throughput performance, LMDeploy achieves an average speedup of 15% compared to vLLM+MARLIN, with peak improvements of 27% observed in validation tasks… view at source ↗
Figure 15
Figure 15. Figure 15: Serving latencies of LMDeploy compared with vLLM+MARLIN on different models on A100 GPUs. this comprehensive model suite, LMDeploy achieves an av￾erage serving latency improvement of 21.1%, with maximum improvements reaching 47.9%. At the critical P99 latency percentile, LMDeploy delivers an average improvement of 20.0% with peak improvements of 39.2%, ensuring reliable performance even under tail latency… view at source ↗
Figure 19
Figure 19. Figure 19: Latency and throughput comparison between LMDeploy and vLLM+MARLIN with the FP8 Qwen3 8B model on an H100 GPU. 1777 2279 2781 3284 3786 Thpt (t/s) 18.3% 74.3% 47.5% A100, Llama2 7B 1605 2184 2764 3344 3923 18.7% 100.0% 42.5% A100, Llama3 8B 1104 1364 1624 1884 2144 12.0% 58.9% 52.6% A100, Llama2 13B 9 137 265 393 520 Thpt (t/s) 12.9% OOM 32.1% A100, Llama2 70B 9 113 216 320 424 13.3% OOM 169.3% A100, Qwen… view at source ↗
Figure 17
Figure 17. Figure 17: End-to-end experiments of LMDeploy compared with TensorRT-LLM on L40S and A100 GPUs. AVG P90 P95 P96 P97 P98 P99 5.6 10.0 14.5 18.9 23.4 Latency (s) 39.4% A100, Qwen3 8B AVG P90 P95 P96 P97 P98 P99 19.2 32.6 46.0 59.4 72.9 36.4% A100, Qwen3 32B AVG P90 P95 P96 P97 P98 P99 5.1 7.7 10.4 13.0 15.6 15.5% H100, Qwen3 8B AVG P90 P95 P96 P97 P98 P99 18.4 25.5 32.7 39.9 47.0 6.3% H100, Qwen3 32B 1 4 16 64 256512 … view at source ↗
Figure 18
Figure 18. Figure 18: Latency and throughput comparison between LMDeploy and vLLM+MARLIN with 8-bit KV cache. 118.90%, with peak speedups reaching 171.11% across differ￾ent batch configurations. And LMDeploy reduces TTFT by an average of 52.2%, with maximum improvements of 65.0%. For end-to-end latency across all percentile measurements, LMDeploy delivers an average reduction of 50.3%, with peak improvements of 59.2%. These su… view at source ↗
Figure 21
Figure 21. Figure 21: Throughput comparison between different KV precision of LMDeploy with different serving batch sizes on an A100 GPU. different GPU and model configurations and report the op￾timal variant for each case. LMDeploy consistently out￾performs all baselines, achieving an average throughput improvement of 14.1% (maximum: 23.0%) over OmniServe￾QServe despite the latter’s use of more aggressive 8-bit acti￾vation qu… view at source ↗
Figure 22
Figure 22. Figure 22: Illustration of wasted memory bandwidth due to uncoalesced memory access (an example of two transactions). Modern GPUs achieve peak memory bandwidth when the memory addresses accessed by every thread within a warp are within the same aligned segment of global memory (e.g., 3-byte on Hopper/Ampere). This alignment enables the warp to access contiguous memory regions through one efficient global memory tran… view at source ↗
Figure 23
Figure 23. Figure 23: Illustration of reduced memory throughput due to shared memory bank conflicts (an example of 32-way bank conflict). Useful Data Wasted Padding Hardware MMA Tile (4x8) Data Matrix (4x6) 𝐞𝟕 𝐞𝟔 𝐞𝟓 𝐞𝟒 𝐞𝟑 𝐞𝟐 𝐞𝟏 𝐞𝟎 𝐞𝟕 𝐞𝟓 𝐞𝟑 𝐞𝟏 𝐞𝟔 𝐞𝟒 𝐞𝟐 𝐞𝟎 𝐞𝟕 𝐞𝟓 𝐞𝟑 𝐞𝟏 𝐞𝟔 𝐞𝟒 𝐞𝟐 𝐞𝟎 Isolate & Shift Combine Data in Register Hardware Required Layout Padding Shuffling [PITH_FULL_IMAGE:figures/full_fig_p016_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Illustration of padding and shuffling in MMA data misalignment. extra shuffles. After swizzling, the same logical tile is permuted so that the ldmatrix loads are conflict-free and each lane receives exactly the elements the MMA instruction expects, while the horizontal cp.async writes remain coalesced. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 9 … view at source ↗
Figure 25
Figure 25. Figure 25: Illustration of 8×128 byte swizzle unit. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Memory bandwidth utilization of LMDeploy ’s attention kernel at different batch sizes. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Latency comparison between LMDeploy and vLLM using general inference configuration W16A16KV16 (without mixed-precision formats) on H100 GPUs. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Scalability of LMDeploy in multi-GPU serving (tensor parallelism degree = {1, 2, 4, 8}). Scalability of LMDeploy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗
read the original abstract

Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query, Key, and Value precision combinations. These pipelines are enabled by four key techniques: (i) Hardware-aware weight packing and (ii) adaptive head alignment for generalizability, and (iii) instruction-level parallelism and (iv) a KV memory loading pipeline for efficiency. We conduct comprehensive evaluations of LMDeploy powered by TurboMind across sixteen popular LLMs and four representative GPU architectures. Results demonstrate that LMDeploy achieves up to 61% lower serving latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TurboMind, a mixed-precision LLM inference engine integrated into LMDeploy. It proposes two hardware-aware pipelines (GEMM and attention) enabled by four techniques—hardware-aware weight packing, adaptive head alignment, instruction-level parallelism, and KV memory loading pipeline—to achieve generalizability across hardware and precision formats while improving efficiency. Evaluations on 16 LLMs and 4 GPUs report up to 61% lower latency (30% average) and 156% higher throughput (58% average) versus existing mixed-precision frameworks.

Significance. If the empirical gains prove robust with fair baselines and the techniques demonstrate genuine generalizability, the work could meaningfully advance practical mixed-precision LLM serving by reducing reliance on fragmented per-hardware kernels. The open-sourcing of the code is a clear strength that aids reproducibility and community validation.

major comments (1)
  1. [Evaluation] Evaluation section: Results are reported only on four GPU architectures and sixteen models with no ablation isolating the adaptive head alignment or hardware-aware packing logic. This leaves the central claim that the four techniques 'automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels' insufficiently supported, as the manuscript provides no additional architectures, cross-precision stress tests, or code-level evidence of genericity.
minor comments (2)
  1. The abstract and introduction refer to 'existing mixed-precision frameworks' as baselines; the main text should explicitly name the compared systems (e.g., vLLM, TensorRT-LLM variants) and confirm identical precision configurations and batch sizes for each.
  2. Figure captions and tables would benefit from explicit mention of whether error bars or multiple runs are included, given the performance variability typical in LLM serving benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the evaluation section below, clarifying the support for our generalizability claims and outlining planned revisions.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: Results are reported only on four GPU architectures and sixteen models with no ablation isolating the adaptive head alignment or hardware-aware packing logic. This leaves the central claim that the four techniques 'automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels' insufficiently supported, as the manuscript provides no additional architectures, cross-precision stress tests, or code-level evidence of genericity.

    Authors: We agree that ablation studies isolating the contributions of adaptive head alignment and hardware-aware weight packing would strengthen the evidence for the generalizability claims. In the revised manuscript we will add these ablations, along with expanded discussion of how the hardware-aware pipelines enable adaptation across precision formats without per-hardware kernels. The reported results already show consistent gains (up to 61% lower latency and 156% higher throughput) across 16 models and 4 representative GPU architectures, which were chosen to cover different compute and memory characteristics. The open-sourced code provides direct inspectable evidence of the implementation approach. We will also incorporate additional cross-precision results to the extent space allows. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks with no derived predictions or self-referential equations

full rationale

The paper reports measured latency and throughput improvements from running sixteen LLMs on four GPUs and comparing against existing frameworks. No equations, fitted parameters, or first-principles derivations appear in the abstract or described content; the four techniques are presented as engineering implementations whose benefits are validated by direct experiment rather than by construction from the input data. The central claims therefore remain independent of any self-definition or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on standard assumptions about GPU hardware capabilities and the correctness of mixed-precision arithmetic; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Target GPUs support the described instruction-level parallelism and memory access patterns for the KV loading pipeline.
    Invoked to justify the efficiency of the attention and GEMM pipelines across the four tested architectures.

pith-pipeline@v0.9.0 · 5825 in / 1312 out tokens · 31755 ms · 2026-05-21T22:12:06.773078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

    cs.AI 2026-05 unverdicted novelty 7.0

    MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.

  2. HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

    cs.DC 2026-05 unverdicted novelty 7.0

    HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.

  3. Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

    stat.ML 2026-05 unverdicted novelty 7.0

    MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...

  4. Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

    cs.DC 2026-04 unverdicted novelty 7.0

    Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

  5. The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

    cs.LG 2026-05 conditional novelty 6.0

    Different inference backends alter LLM benchmark scores by up to 16.6 percentage points through optimizations such as prefix caching, CUDA graphs, and custom kernels.

  6. The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

    cs.LG 2026-05 unverdicted novelty 6.0

    Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.

  7. SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.

  8. HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

    cs.DC 2026-05 unverdicted novelty 6.0

    HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 7 Pith papers · 6 internal anchors

  1. [1]

    NVIDIA Ampere GPU Architecture Tuning Guide

    2024. NVIDIA Ampere GPU Architecture Tuning Guide. https: //docs.nvidia.com/cuda/ampere-tuning-guide/index.html

  2. [2]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24) . 117–134

  3. [3]

    AI-MO. 2024. AIMO Validation AIME Dataset. https://huggingface. co/datasets/AI-MO/aimo-validation-aime

  4. [4]

    AI-MO. 2024. NuminaMath-CoT: A Large-Scale Math Dataset with Chain of Thought. https://huggingface.co/datasets/AI-MO/ NuminaMath-CoT

  5. [5]

    Rajeev Alur, Joseph Devietti, Omar S Navarro Leija, and Nimit Sing- hania. 2017. GPUDrano: Detecting uncoalesced accesses in GPU programs. In Computer Aided Verification: 29th International Confer- ence, CA V 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30. Springer, 507–525

  6. [6]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis...

  7. [7]

    Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf

  8. [8]

    Girish Biswas and Nandini Mukherjee. 2020. Memory optimized dynamic matrix chain multiplication using shared memory in GPU. In International Conference on Distributed Computing and Internet Technology. Springer, 160–172

  9. [9]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . 578–594

  10. [10]

    Colfax Research. 2024. CUTLASS Tutorial: Design of a GEMM Ker- nel. https://research.colfax-intl.com/cutlass-tutorial-design-of-a- gemm-kernel/

  11. [11]

    Tri Dao. [n. d.]. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations

  12. [12]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  13. [13]

    Advances in neural information processing systems 35 (2022), 16344–16359

    Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

  14. [14]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

  15. [15]

    Advances in neural information processing systems 36 (2023), 10088–10115

    Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115

  16. [16]

    Dayou Du, Shijie Cao, Jianyi Cheng, Ting Cao, and Mao Yang. 2025. BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decod- ing with Low-Bit KV Cache. arXiv preprint arXiv:2503.18773 (2025). 12

  17. [17]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  18. [18]

    Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. Turbotrans- formers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402

  19. [19]

    Naznin Fauzia, Louis-Noël Pouchet, and P Sadayappan. 2015. Char- acterizing and enhancing global memory data coalescing on GPUs. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 12–22

  20. [20]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022)

  21. [21]

    Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. Marlin: Mixed-precision auto-regressive parallel in- ference on large language models. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 239–251

  22. [22]

    Shuang Gao. 2014. Improving gpu shared memory access efficiency. (2014)

  23. [23]

    Mark Gebhart, Stephen W Keckler, Brucek Khailany, Ronny Krashin- sky, and William J Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In 2012 45th An- nual IEEE/ACM International Symposium on Microarchitecture . IEEE, 96–106

  24. [24]

    GitHub. 2024. The world’s most widely adopted ai developer tool. https://github.com/features/copilot

  25. [25]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948 (2025)

  26. [26]

    Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. [n. d.]. ZipCache: Accurate and Efficient KV Cache Quanti- zation with Salient Token Identification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  27. [27]

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quan- tization. Advances in Neural Information Processing Systems 37 (2024), 1270–1303

  28. [28]

    Adrian Horga, Ahmed Rezine, Sudipta Chattopadhyay, Petru Eles, and Zebo Peng. 2022. Symbolic identification of shared memory based bank conflicts for GPUs. Journal of Systems Architecture 127 (2022), 102518

  29. [29]

    Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies 28, 12 (2023), 15873–15892

  30. [30]

    YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Guoliang HE, Xupeng Miao, Ana Klimovic, Bin CUI, Binhang Yuan, and Eiko Yoneki. 2025. Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. In Forty-second International Conference on Machine Learning

  31. [31]

    YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin CUI, Ana Klimovic, and Eiko Yoneki. [n. d.]. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. In Eighth Conference on Machine Learning and Systems

  32. [32]

    Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In International Conference on Machine Learning. PMLR, 21946–21961

  33. [33]

    YOUHE JIANG, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disag- gregated Generative Inference of LLMs in Heterogeneous Environ- ment. In The Thirteenth International Conference on Learning Repre- sentations

  34. [34]

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527 (2024)

  35. [35]

    Dae-Hwan Kim. 2017. Evaluation of the performance of GPU global memory coalescing. Evaluation 4, 4 (2017), 1–5

  36. [36]

    Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, and Hyungjun Kim. 2024. QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference.arXiv preprint arXiv:2402.10076 (2024)

  37. [37]

    Young Jin Kim, Rawn Henry, Raffy Fahim, and Hany Hassan Awadalla

  38. [38]

    In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP)

    Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production. In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP) . 36–43

  39. [39]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  40. [40]

    In Proceedings of the 29th Symposium on Operating Systems Principles

    Efficient memory management for large language model serv- ing with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

  41. [41]

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . 663–679

  42. [42]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87–100

  43. [43]

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. Qserve: W4a8kv4 quanti- zation and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532 (2024)

  44. [44]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning- Free Asymmetric 2bit Quantization for KV Cache. In International Conference on Machine Learning . PMLR, 32332–32344

  45. [45]

    Justin Luitjens. 2025. CUDA Pro Tip: Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/cuda- pro-tip-increase-performance-with-vectorized-memory-access/

  46. [46]

    Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. 2024. Benchmarking and dissecting the nvidia hopper gpu archi- tecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 656–667

  47. [47]

    Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 . 1112–1127

  48. [48]

    Mistral AI. 2024. Mixtral 8x22B: Cheaper, Better, Faster, Stronger. https://mistral.ai/news/mixtral-8x22b

  49. [49]

    NVIDIA Corporation. 2014. cuDNN: NVIDIA CUDA Deep Neural Network Library. https://developer.nvidia.com/cudnn

  50. [50]

    NVIDIA Corporation. 2019. FasterTransformer: Transformer related optimization, including BERT, GPT. https://github.com/NVIDIA/ FasterTransformer

  51. [51]

    NVIDIA Corporation. 2020. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass 13

  52. [52]

    NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/a100/

  53. [53]

    NVIDIA Corporation. 2022. NVIDIA GeForce RTX 4090 Graph- ics Card. https://www.nvidia.com/en-us/geforce/graphics-cards/40- series/rtx-4090/

  54. [54]

    NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/h100/

  55. [55]

    NVIDIA Corporation. 2023. NVIDIA L40S Data Center GPU. https: //www.nvidia.com/en-us/data-center/l40s/

  56. [56]

    NVIDIA Corporation. 2024. Efficient GEMM in CUDA. https://github. com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md

  57. [57]

    NVIDIA Corporation. 2024. NVIDIA TensorRT 10.0.1 Developer Guide. https://docs.nvidia.com/deeplearning/tensorrt/archives/ tensorrt-1001/developer-guide/index.html

  58. [58]

    NVIDIA Corporation. 2025. CUDA C++ Programming Guide, Release 12.9. https://docs.nvidia.com/cuda/cuda-c-programming-guide/

  59. [59]

    NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA: ldmatrix Instruction. https://docs.nvidia.com/cuda/parallel-thread- execution/

  60. [60]

    NVIDIA Corporation. 2025. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM

  61. [61]

    NVIDIA Corporation. 2025. Working with Quantized Types. https://docs.nvidia.com/deeplearning/tensorrt/latest/inference- library/work-quantized-types.html

  62. [62]

    OpenAI. 2025. OpenAI o3. https://platform.openai.com/docs/models/ o3

  63. [63]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 118–132

  64. [64]

    Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. 2023. A study of generative large language model for medical research and healthcare. NPJ digital medicine 6, 1 (2023), 210

  65. [65]

    PyTorch Core Team. 2025. PyTorch. https://pytorch.org

  66. [66]

    Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

  67. [67]

    Mariam Rakka, Mohammed E Fouda, Pramod Khargonekar, and Fadi Kurdahi. 2022. Mixed-precision neural networks: A survey. arXiv preprint arXiv:2208.06064 (2022)

  68. [68]

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  69. [69]

    ShareGPT Team. 2023. ShareGPT: Share your wildest ChatGPT con- versations with one click. https://sharegpt.com/

  70. [70]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

  71. [71]

    Yifan Tan, Haoze Wang, Chao Yan, and Yangdong Deng. 2024. AlignedKV: Reducing Memory Access of KV-Cache with Precision- Aligned Quantization. arXiv preprint arXiv:2409.16546 (2024)

  72. [72]

    vLLM Team. 2024. Quantized KV Cache. https://docs.vllm.ai/en/ stable/features/quantization/quantized_kvcache.html

  73. [73]

    vLLM Team. 2024. vLLM Quantization: Supported Hard- ware. https://docs.vllm.ai/en/latest/features/quantization/supported_ hardware.html

  74. [74]

    Wright, Less and Hoque, Adnan. 2024. Accelerating Triton Dequantiza- tion Kernels for GPTQ. https://pytorch.org/blog/accelerating-triton/

  75. [75]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099

  76. [76]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  77. [77]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538

  78. [78]

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196–209

  79. [79]

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20) . 863–879

  80. [80]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems 37 (2024), 62557–62583

Showing first 80 references.