LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3
The pith
A mixed-precision inference engine for large language models achieves up to 61 percent lower latency and 156 percent higher throughput by using hardware-aware pipelines that generalize without custom kernels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that TurboMind, the inference engine, delivers generalizable mixed-precision LLM serving through a GEMM pipeline that optimizes matrix operations via offline weight packing and online acceleration, together with an attention pipeline for efficient computation across different Query, Key, and Value precision combinations. These are realized by hardware-aware weight packing, adaptive head alignment, instruction-level parallelism, and a KV memory loading pipeline. Comprehensive tests on sixteen popular LLMs and four representative GPU architectures show up to 61 percent lower serving latency with 30 percent on average and up to 156 percent higher throughput with 58 percent,
What carries the argument
Two hardware-aware mixed-precision pipelines (a GEMM pipeline for matrix operations and an attention pipeline for Query-Key-Value computations) enabled by four techniques: hardware-aware weight packing and adaptive head alignment for generalizability plus instruction-level parallelism and KV memory loading pipeline for efficiency.
Load-bearing premise
The four key techniques automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels for each combination.
What would settle it
Running the same mixed-precision workloads on a previously untested GPU architecture or precision format where performance gains disappear or manual kernel tuning becomes necessary would show the generalizability claim does not hold.
Figures
read the original abstract
Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query, Key, and Value precision combinations. These pipelines are enabled by four key techniques: (i) Hardware-aware weight packing and (ii) adaptive head alignment for generalizability, and (iii) instruction-level parallelism and (iv) a KV memory loading pipeline for efficiency. We conduct comprehensive evaluations of LMDeploy powered by TurboMind across sixteen popular LLMs and four representative GPU architectures. Results demonstrate that LMDeploy achieves up to 61% lower serving latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TurboMind, a mixed-precision LLM inference engine integrated into LMDeploy. It proposes two hardware-aware pipelines (GEMM and attention) enabled by four techniques—hardware-aware weight packing, adaptive head alignment, instruction-level parallelism, and KV memory loading pipeline—to achieve generalizability across hardware and precision formats while improving efficiency. Evaluations on 16 LLMs and 4 GPUs report up to 61% lower latency (30% average) and 156% higher throughput (58% average) versus existing mixed-precision frameworks.
Significance. If the empirical gains prove robust with fair baselines and the techniques demonstrate genuine generalizability, the work could meaningfully advance practical mixed-precision LLM serving by reducing reliance on fragmented per-hardware kernels. The open-sourcing of the code is a clear strength that aids reproducibility and community validation.
major comments (1)
- [Evaluation] Evaluation section: Results are reported only on four GPU architectures and sixteen models with no ablation isolating the adaptive head alignment or hardware-aware packing logic. This leaves the central claim that the four techniques 'automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels' insufficiently supported, as the manuscript provides no additional architectures, cross-precision stress tests, or code-level evidence of genericity.
minor comments (2)
- The abstract and introduction refer to 'existing mixed-precision frameworks' as baselines; the main text should explicitly name the compared systems (e.g., vLLM, TensorRT-LLM variants) and confirm identical precision configurations and batch sizes for each.
- Figure captions and tables would benefit from explicit mention of whether error bars or multiple runs are included, given the performance variability typical in LLM serving benchmarks.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment on the evaluation section below, clarifying the support for our generalizability claims and outlining planned revisions.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: Results are reported only on four GPU architectures and sixteen models with no ablation isolating the adaptive head alignment or hardware-aware packing logic. This leaves the central claim that the four techniques 'automatically generalize across diverse hardware architectures and precision formats without requiring fragmented hand-tuned kernels' insufficiently supported, as the manuscript provides no additional architectures, cross-precision stress tests, or code-level evidence of genericity.
Authors: We agree that ablation studies isolating the contributions of adaptive head alignment and hardware-aware weight packing would strengthen the evidence for the generalizability claims. In the revised manuscript we will add these ablations, along with expanded discussion of how the hardware-aware pipelines enable adaptation across precision formats without per-hardware kernels. The reported results already show consistent gains (up to 61% lower latency and 156% higher throughput) across 16 models and 4 representative GPU architectures, which were chosen to cover different compute and memory characteristics. The open-sourced code provides direct inspectable evidence of the implementation approach. We will also incorporate additional cross-precision results to the extent space allows. revision: yes
Circularity Check
No circularity: empirical benchmarks with no derived predictions or self-referential equations
full rationale
The paper reports measured latency and throughput improvements from running sixteen LLMs on four GPUs and comparing against existing frameworks. No equations, fitted parameters, or first-principles derivations appear in the abstract or described content; the four techniques are presented as engineering implementations whose benefits are validated by direct experiment rather than by construction from the input data. The central claims therefore remain independent of any self-definition or self-citation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Target GPUs support the described instruction-level parallelism and memory access patterns for the KV loading pipeline.
Forward citations
Cited by 8 Pith papers
-
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
-
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
-
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
Different inference backends alter LLM benchmark scores by up to 16.6 percentage points through optimizations such as prefix caching, CUDA graphs, and custom kernels.
-
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.
-
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
-
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
Reference graph
Works this paper leans on
-
[1]
NVIDIA Ampere GPU Architecture Tuning Guide
2024. NVIDIA Ampere GPU Architecture Tuning Guide. https: //docs.nvidia.com/cuda/ampere-tuning-guide/index.html
work page 2024
-
[2]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24) . 117–134
work page 2024
-
[3]
AI-MO. 2024. AIMO Validation AIME Dataset. https://huggingface. co/datasets/AI-MO/aimo-validation-aime
work page 2024
-
[4]
AI-MO. 2024. NuminaMath-CoT: A Large-Scale Math Dataset with Chain of Thought. https://huggingface.co/datasets/AI-MO/ NuminaMath-CoT
work page 2024
-
[5]
Rajeev Alur, Joseph Devietti, Omar S Navarro Leija, and Nimit Sing- hania. 2017. GPUDrano: Detecting uncoalesced accesses in GPU programs. In Computer Aided Verification: 29th International Confer- ence, CA V 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30. Springer, 507–525
work page 2017
-
[6]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis...
work page 2022
-
[7]
Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf
work page 2024
-
[8]
Girish Biswas and Nandini Mukherjee. 2020. Memory optimized dynamic matrix chain multiplication using shared memory in GPU. In International Conference on Distributed Computing and Internet Technology. Springer, 160–172
work page 2020
-
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . 578–594
work page 2018
-
[10]
Colfax Research. 2024. CUTLASS Tutorial: Design of a GEMM Ker- nel. https://research.colfax-intl.com/cutlass-tutorial-design-of-a- gemm-kernel/
work page 2024
-
[11]
Tri Dao. [n. d.]. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations
-
[12]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[13]
Advances in neural information processing systems 35 (2022), 16344–16359
Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022), 16344–16359
work page 2022
-
[14]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer
-
[15]
Advances in neural information processing systems 36 (2023), 10088–10115
Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115
work page 2023
- [16]
-
[17]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. Turbotrans- formers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402
work page 2021
-
[19]
Naznin Fauzia, Louis-Noël Pouchet, and P Sadayappan. 2015. Char- acterizing and enhancing global memory data coalescing on GPUs. In 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 12–22
work page 2015
-
[20]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. Marlin: Mixed-precision auto-regressive parallel in- ference on large language models. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 239–251
work page 2025
-
[22]
Shuang Gao. 2014. Improving gpu shared memory access efficiency. (2014)
work page 2014
-
[23]
Mark Gebhart, Stephen W Keckler, Brucek Khailany, Ronny Krashin- sky, and William J Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In 2012 45th An- nual IEEE/ACM International Symposium on Microarchitecture . IEEE, 96–106
work page 2012
-
[24]
GitHub. 2024. The world’s most widely adopted ai developer tool. https://github.com/features/copilot
work page 2024
-
[25]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. [n. d.]. ZipCache: Accurate and Efficient KV Cache Quanti- zation with Salient Token Identification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems
-
[27]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quan- tization. Advances in Neural Information Processing Systems 37 (2024), 1270–1303
work page 2024
-
[28]
Adrian Horga, Ahmed Rezine, Sudipta Chattopadhyay, Petru Eles, and Zebo Peng. 2022. Symbolic identification of shared memory based bank conflicts for GPUs. Journal of Systems Architecture 127 (2022), 102518
work page 2022
-
[29]
Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies 28, 12 (2023), 15873–15892
work page 2023
-
[30]
YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Guoliang HE, Xupeng Miao, Ana Klimovic, Bin CUI, Binhang Yuan, and Eiko Yoneki. 2025. Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. In Forty-second International Conference on Machine Learning
work page 2025
-
[31]
YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin CUI, Ana Klimovic, and Eiko Yoneki. [n. d.]. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. In Eighth Conference on Machine Learning and Systems
-
[32]
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In International Conference on Machine Learning. PMLR, 21946–21961
work page 2024
-
[33]
YOUHE JIANG, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disag- gregated Generative Inference of LLMs in Heterogeneous Environ- ment. In The Thirteenth International Conference on Learning Repre- sentations
work page 2025
- [34]
-
[35]
Dae-Hwan Kim. 2017. Evaluation of the performance of GPU global memory coalescing. Evaluation 4, 4 (2017), 1–5
work page 2017
- [36]
-
[37]
Young Jin Kim, Rawn Henry, Raffy Fahim, and Hany Hassan Awadalla
-
[38]
Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production. In Proceedings of The Third Work- shop on Simple and Efficient Natural Language Processing (SustaiNLP) . 36–43
-
[39]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[40]
In Proceedings of the 29th Symposium on Operating Systems Principles
Efficient memory management for large language model serv- ing with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626
-
[41]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . 663–679
work page 2023
-
[42]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87–100
work page 2024
- [43]
-
[44]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning- Free Asymmetric 2bit Quantization for KV Cache. In International Conference on Machine Learning . PMLR, 32332–32344
work page 2024
-
[45]
Justin Luitjens. 2025. CUDA Pro Tip: Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/cuda- pro-tip-increase-performance-with-vectorized-memory-access/
work page 2025
-
[46]
Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. 2024. Benchmarking and dissecting the nvidia hopper gpu archi- tecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 656–667
work page 2024
-
[47]
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. Spotserve: Serving generative large language models on preemptible instances. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 . 1112–1127
work page 2024
-
[48]
Mistral AI. 2024. Mixtral 8x22B: Cheaper, Better, Faster, Stronger. https://mistral.ai/news/mixtral-8x22b
work page 2024
-
[49]
NVIDIA Corporation. 2014. cuDNN: NVIDIA CUDA Deep Neural Network Library. https://developer.nvidia.com/cudnn
work page 2014
-
[50]
NVIDIA Corporation. 2019. FasterTransformer: Transformer related optimization, including BERT, GPT. https://github.com/NVIDIA/ FasterTransformer
work page 2019
-
[51]
NVIDIA Corporation. 2020. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass 13
work page 2020
-
[52]
NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/a100/
work page 2020
-
[53]
NVIDIA Corporation. 2022. NVIDIA GeForce RTX 4090 Graph- ics Card. https://www.nvidia.com/en-us/geforce/graphics-cards/40- series/rtx-4090/
work page 2022
-
[54]
NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Archi- tecture. https://www.nvidia.com/en-us/data-center/h100/
work page 2022
-
[55]
NVIDIA Corporation. 2023. NVIDIA L40S Data Center GPU. https: //www.nvidia.com/en-us/data-center/l40s/
work page 2023
-
[56]
NVIDIA Corporation. 2024. Efficient GEMM in CUDA. https://github. com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
work page 2024
-
[57]
NVIDIA Corporation. 2024. NVIDIA TensorRT 10.0.1 Developer Guide. https://docs.nvidia.com/deeplearning/tensorrt/archives/ tensorrt-1001/developer-guide/index.html
work page 2024
-
[58]
NVIDIA Corporation. 2025. CUDA C++ Programming Guide, Release 12.9. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
work page 2025
-
[59]
NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA: ldmatrix Instruction. https://docs.nvidia.com/cuda/parallel-thread- execution/
work page 2025
-
[60]
NVIDIA Corporation. 2025. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM
work page 2025
-
[61]
NVIDIA Corporation. 2025. Working with Quantized Types. https://docs.nvidia.com/deeplearning/tensorrt/latest/inference- library/work-quantized-types.html
work page 2025
-
[62]
OpenAI. 2025. OpenAI o3. https://platform.openai.com/docs/models/ o3
work page 2025
-
[63]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 118–132
work page 2024
-
[64]
Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. 2023. A study of generative large language model for medical research and healthcare. NPJ digital medicine 6, 1 (2023), 210
work page 2023
-
[65]
PyTorch Core Team. 2025. PyTorch. https://pytorch.org
work page 2025
-
[66]
Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/
work page 2025
- [67]
-
[68]
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
ShareGPT Team. 2023. ShareGPT: Share your wildest ChatGPT con- versations with one click. https://sharegpt.com/
work page 2023
-
[70]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [71]
-
[72]
vLLM Team. 2024. Quantized KV Cache. https://docs.vllm.ai/en/ stable/features/quantization/quantized_kvcache.html
work page 2024
-
[73]
vLLM Team. 2024. vLLM Quantization: Supported Hard- ware. https://docs.vllm.ai/en/latest/features/quantization/supported_ hardware.html
work page 2024
-
[74]
Wright, Less and Hoque, Adnan. 2024. Accelerating Triton Dequantiza- tion Kernels for GPTQ. https://pytorch.org/blog/accelerating-triton/
work page 2024
-
[75]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099
work page 2023
-
[76]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538
work page 2022
-
[78]
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196–209
work page 2024
-
[79]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating {High-Performance} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20) . 863–879
work page 2020
-
[80]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems 37 (2024), 62557–62583
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.