pith. sign in

arxiv: 2403.01164 · v1 · pith:WKFOJ2ECnew · submitted 2024-03-02 · 💻 cs.PF · cs.DC

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

classification 💻 cs.PF cs.DC
keywords inferencedeviceshetegenheterogeneousllmsparallelbottleneckscomputing
0
0 comments X
read the original abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

    cs.DC 2026-01 conditional novelty 7.0

    SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.