BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Baris Kasikci; Ion Stoica; Jiarong Xing; Kan Zhu; Lianmin Zheng; Shuo Yang; Yang Zhou; Yilong Zhao

arxiv: 2411.16102 · v2 · pith:LJSVDZREnew · submitted 2024-11-25 · 💻 cs.LG

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Yilong Zhao , Shuo Yang , Kan Zhu , Lianmin Zheng , Baris Kasikci , Yang Zhou , Jiarong Xing , Ion Stoica This is my paper

classification 💻 cs.LG

keywords inferenceresourceblendserveofflineprefixthroughputbatchmaximizes

0 comments

read the original abstract

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 7.0

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
cs.DC 2026-05 unverdicted novelty 5.0

PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.