Efficient memory management for large language model serving with PagedAttention,

· 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Two constraint-aware greedy heuristics (GH and AGH) solve mixed-scale LLM allocation on heterogeneous GPUs under SLO constraints in under one second with over 260x speedup and near-optimal cost compared to exact MILP.

citing papers explorer

Showing 1 of 1 citing paper.

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference cs.LG · 2026-04-08 · unverdicted · none · ref 2
Two constraint-aware greedy heuristics (GH and AGH) solve mixed-scale LLM allocation on heterogeneous GPUs under SLO constraints in under one second with over 260x speedup and near-optimal cost compared to exact MILP.

Efficient memory management for large language model serving with PagedAttention,

fields

years

verdicts

representative citing papers

citing papers explorer