HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Bin Jia; Haotian Zhou; Shenggan Cheng; Xuanlei Zhao; Yang You; Ziming Liu

arxiv: 2403.01164 · v1 · pith:WKFOJ2ECnew · submitted 2024-03-02 · 💻 cs.PF · cs.DC

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Xuanlei Zhao , Bin Jia , Haotian Zhou , Ziming Liu , Shenggan Cheng , Yang You This is my paper

classification 💻 cs.PF cs.DC

keywords inferencedeviceshetegenheterogeneousllmsparallelbottleneckscomputing

0 comments

read the original abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
cs.DC 2026-01 conditional novelty 7.0

SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.