Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.
CUDA Multi-Process Service, 2026a
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.OS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.