DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Ammar Ahmad Awan; Cheng Li; Du Li; Elton Zheng; Jeff Rasley; Minjia Zhang; Olatunji Ruwase; Reza Yazdani Aminabadi; Samyam Rajbhandari; Shaden Smith

arxiv: 2207.00032 · v1 · pith:WEVYLKJ2new · submitted 2022-06-30 · 💻 cs.LG · cs.DC· cs.PF

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Reza Yazdani Aminabadi , Samyam Rajbhandari , Minjia Zhang , Ammar Ahmad Awan , Cheng Li , Du Li , Elton Zheng , Jeff Rasley

show 3 more authors

Shaden Smith Olatunji Ruwase Yuxiong He

This is my paper

classification 💻 cs.LG cs.DCcs.PF

keywords inferencemodelsmemorytransformerdeepspeedscalescenariosthroughput

0 comments

read the original abstract

The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
AoiZora: Topology-Aware Auto-Parallel Optimization for Inference of Diffusion Transformers
cs.DC 2026-06 unverdicted novelty 6.0

AoiZora adds topology-aware physical placement planning to auto-parallel compilation for diffusion transformer inference, cutting one-step denoising latency by up to 1.42x on TPU v5e sub-slices.
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eli...
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
cs.DC 2026-05 unverdicted novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal ...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
cs.AI 2026-02 unverdicted novelty 6.0

SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
cs.LG 2023-06 unverdicted novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache
cs.LG 2026-07 unverdicted novelty 5.0

GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 5.0

Predict-then-Diffuse predicts response length for diffusion LLMs before inference, cutting FLOPs with a data-driven safety buffer while preserving output quality.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
cs.AR 2025-09 unverdicted novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
Optimizing Teacher-Student Partitioning for Scalable Knowledge Distillation on HPC Systems
cs.DC 2026-06 unverdicted novelty 3.0

The paper introduces an HPC-aware teacher-student partitioning strategy for knowledge distillation that combines vertical and horizontal splits and reports up to 67% higher throughput than the symmetric TRL baseline.