Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.
Hybridflow: A flexible and efficient rlhf frame- work.EuroSys 2025 (30/03/2025-03/04/2025, Rotter- dam)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.OS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.