Efficient mixed- precision large language model inference with turbomind

· 2025 · cs.DC · arXiv 2508.15601

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query, Key, and Value precision combinations. These pipelines are enabled by four key techniques: (i) Hardware-aware weight packing and (ii) adaptive head alignment for generalizability, and (iii) instruction-level parallelism and (iv) a KV memory loading pipeline for efficiency. We conduct comprehensive evaluations of LMDeploy powered by TurboMind across sixteen popular LLMs and four representative GPU architectures. Results demonstrate that LMDeploy achieves up to 61% lower serving latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

cs.DC · 2026-05-15 · unverdicted · novelty 7.0

HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

cs.LG · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.

HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

citing papers explorer

Showing 7 of 7 citing papers.

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing cs.AI · 2026-05-21 · unverdicted · none · ref 68 · internal anchor
MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling cs.DC · 2026-05-15 · unverdicted · none · ref 53 · internal anchor
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference stat.ML · 2026-05-13 · unverdicted · none · ref 23
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics cs.DC · 2026-04-08 · unverdicted · none · ref 44
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility cs.LG · 2026-05-19 · unverdicted · none · ref 20 · 2 links · internal anchor
Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 47
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware cs.DC · 2026-05-08 · unverdicted · none · ref 60
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

Efficient mixed- precision large language model inference with turbomind

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer