Two constraint-aware greedy heuristics (GH and AGH) solve mixed-scale LLM allocation on heterogeneous GPUs under SLO constraints in under one second with over 260x speedup and near-optimal cost compared to exact MILP.
Efficient memory management for large language model serving with PagedAttention,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
Two constraint-aware greedy heuristics (GH and AGH) solve mixed-scale LLM allocation on heterogeneous GPUs under SLO constraints in under one second with over 260x speedup and near-optimal cost compared to exact MILP.