Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica · 2024 · arXiv 2411.09317

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

cs.DC · 2026-01-28 · conditional · novelty 7.0

SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.

Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

cs.OS · 2026-05-30 · unverdicted · novelty 6.0

MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.

C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG

cs.OS · 2026-05-19 · unverdicted · novelty 6.0

C2CServe is a request-granularity serverless LLM serving system that keeps weights in host memory and streams them via C2C to MIG instances, cutting cold-start latency up to 7.1x while preserving TTFT/TPOT under contention.

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

cs.DC · 2026-04-28 · unverdicted · novelty 6.0

DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

citing papers explorer

Showing 4 of 4 citing papers after filters.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference cs.DC · 2026-05-27 · unverdicted · none · ref 31
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI cs.OS · 2026-05-30 · unverdicted · none · ref 64
MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.
C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG cs.OS · 2026-05-19 · unverdicted · none · ref 43
C2CServe is a request-granularity serverless LLM serving system that keeps weights in host memory and streams them via C2C to MIG instances, cutting cold-start latency up to 7.1x while preserving TTFT/TPOT under contention.
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference cs.DC · 2026-04-28 · unverdicted · none · ref 37
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317

fields

years

verdicts

representative citing papers

citing papers explorer