Pie: Pooling cpu memory for llm inference

Xu, Y · 2024 · arXiv 2411.09317

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

cs.DC · 2026-01-28 · conditional · novelty 7.0

SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.

C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG

cs.OS · 2026-05-19 · unverdicted · novelty 6.0

C2CServe is a request-granularity serverless LLM serving system that keeps weights in host memory and streams them via C2C to MIG instances, cutting cold-start latency up to 7.1x while preserving TTFT/TPOT under contention.

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

cs.DC · 2026-04-28 · unverdicted · novelty 6.0

DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

citing papers explorer

Showing 3 of 3 citing papers.

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips cs.DC · 2026-01-28 · conditional · none · ref 21
SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.
C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG cs.OS · 2026-05-19 · unverdicted · none · ref 43
C2CServe is a request-granularity serverless LLM serving system that keeps weights in host memory and streams them via C2C to MIG instances, cutting cold-start latency up to 7.1x while preserving TTFT/TPOT under contention.
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference cs.DC · 2026-04-28 · unverdicted · none · ref 37
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

Pie: Pooling cpu memory for llm inference

fields

years

verdicts

representative citing papers

citing papers explorer