pith. sign in

arxiv: 2604.07472 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.NI

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.LG cs.NI
keywords LLM inferenceheterogeneous GPUsSLO constraintsgreedy heuristicsmixed-scale allocationscalabilityGPU provisioningAzure trace
0
0 comments X

The pith

Two heuristics allocate mixed-scale LLMs on heterogeneous GPUs in under one second while meeting SLOs and approaching optimal cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deploying LLMs at scale requires choosing base models, GPU types, parallelism settings, and workload splits under tight rules for latency, accuracy, and spending. Exact mixed-integer solvers guarantee the best answer but become too slow as problem size grows. The paper develops a simple Greedy Heuristic for one-pass decisions and an Adaptive Greedy Heuristic that improves it with multiple construction attempts, local moves, and GPU consolidation. Three targeted checks keep every choice inside the memory, delay, error, and budget limits at once. On traces modeled after real Azure LLM workloads, the methods finish quickly, keep costs close to the best known, and hold up when model sizes increase beyond the training data.

Core claim

The Adaptive Greedy Heuristic, built from a basic greedy pass plus multi-start construction, relocate local search, and consolidation, together with TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade, yields feasible allocations in under one second that closely match MILP optimal cost and preserve controlled SLO violations under 1.5x parameter inflation, while the exact solver degrades sharply.

What carries the argument

The Adaptive Greedy Heuristic (AGH) with multi-start construction, relocate-based local search, GPU consolidation, and the three constraint-aware mechanisms of TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade.

If this is right

  • Real-time reallocation becomes feasible for continuously arriving inference requests.
  • Operational cost stays near the theoretical minimum without long computation delays.
  • Deployments remain stable when model parameter counts shift after initial planning.
  • Exact solvers can be reserved for small problems while heuristics handle production scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structure could support online scheduling in cloud systems that adjust GPU pools every few minutes.
  • Similar local-search refinements might transfer to other heterogeneous hardware allocation settings such as mixed CPU-GPU training clusters.
  • Pre-computing a small library of good starting allocations could cut the multi-start overhead even further.

Load-bearing premise

The three constraint-aware mechanisms can always produce feasible allocations when memory, delay, error, and budget constraints are tightly coupled.

What would settle it

A workload instance where the heuristics output an allocation that violates SLOs or budget while the exact MILP solver finds a feasible lower-cost solution, or where either heuristic takes more than one second on the paper's large-scale instances.

Figures

Figures reproduced from arXiv: 2604.07472 by Duong Tung Nguyen, Jiaming Cheng.

Figure 1
Figure 1. Figure 1: System model: users submit queries classified by type, routed to heterogeneous GPU tiers hosting foundation models under TP/PP configurations. et al. [12] study cost-efficient serving with heterogeneous VMs and KV cache offloading, SeaLLM [9] enables multi-LLM resource sharing, and SkyLB [11] proposes locality-aware cross-region load balancing. These works demonstrate the importance of heterogeneity-aware … view at source ↗
Figure 2
Figure 2. Figure 2: (a)–(c): Model comparison under delay/error stress. (d)–(f): AGH sensitivity analysis over SLO thresholds ∆i, ϵi, budget δ, and rental cost p c [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one second, with AGH closely approaching optimal cost while achieving over 260x speedup on large-scale instances. Under out-of-sample stress tests with up to 1.5x parameter inflation, AGH maintains controlled SLO violations and stable cost, whereas the exact solver's placement degrades sharply.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes two constraint-aware heuristics—a single-pass Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH) that adds multi-start construction, relocate-based local search, and GPU consolidation—for jointly selecting base models, heterogeneous GPUs, parallelism degrees, and workload distributions under coupled memory, latency, accuracy, and budget constraints. On workloads derived from the Azure LLM Inference Trace (2025), both heuristics return feasible solutions in under one second; AGH approaches the cost of an exact MILP solver while delivering >260× speedup on large instances. Under out-of-sample stress tests with up to 1.5× parameter inflation, AGH keeps SLO violations controlled and cost stable, whereas the exact solver degrades.

Significance. If the reported speedups and robustness hold, the work would be significant for practical LLM serving systems: it shows that carefully designed, constraint-aware heuristics can make mixed-integer allocation tractable at scale without sacrificing feasibility or solution quality. The explicit use of a public trace for calibration plus controlled out-of-sample inflation provides a reproducible empirical foundation that is stronger than purely synthetic evaluations.

minor comments (3)
  1. [Abstract] Abstract: the performance claims (sub-second runtimes, 260× speedup, controlled violations) are stated without any reference to the methods section or to the three named mechanisms (TP-aware feasibility selection, cost-per-effective-coverage ranking, TP upgrade); a single sentence summarizing how these mechanisms enforce feasibility would make the abstract self-contained.
  2. [Evaluation] Evaluation section: while the text describes success on the Azure trace and 1.5× inflation tests, no summary table or figure reports the exact cost ratios, violation counts, or runtime distributions across instance sizes; adding such a table would allow readers to assess the “closely approaching optimal” claim quantitatively.
  3. [Methods] The manuscript would benefit from a short pseudocode listing or algorithmic sketch of AGH (multi-start + relocate + consolidation) to complement the prose description of the three constraint-aware mechanisms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work on the GH and AGH heuristics for mixed-scale LLM allocation under SLO constraints, as well as for highlighting the significance of the speedups, feasibility guarantees, and reproducible evaluation on the Azure trace with out-of-sample stress tests. The recommendation for minor revision is noted; we will incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents two new constraint-aware heuristics (GH and AGH) for mixed-scale LLM allocation under SLO constraints and evaluates them empirically against an exact MILP solver on the Azure LLM Inference Trace (2025) plus out-of-sample stress tests. The reported results (feasibility in <1s, 260x speedup, cost proximity, controlled SLO violations) are direct measurements of algorithm runtime and solution quality on external trace data; no equations reduce these outcomes to fitted parameters defined by the same data, no self-citations bear the central claim, and no ansatz or uniqueness theorem is invoked to force the result. The derivation chain consists of algorithmic construction plus benchmark comparison and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the Azure trace for calibration and the assumption that the three named mechanisms suffice for feasibility; no free parameters, new mathematical axioms, or invented entities are introduced.

axioms (1)
  • domain assumption The Azure LLM Inference Trace (2025) provides representative workloads for evaluating allocation heuristics under realistic request patterns.
    Workloads are calibrated with this trace; the claim of controlled SLO violations under stress tests depends on it being a fair proxy for production.

pith-pipeline@v0.9.0 · 5490 in / 1456 out tokens · 55947 ms · 2026-05-10T17:57:20.385200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Islam, and Shaolei Ren

    A. Chien, L. Fan, and H. Yeung, “Reducing the carbon impact of generative AI inference (today and in 2035),”arXiv preprint arXiv:2304.03271, 2023

  2. [2]

    Efficient memory management for large language model serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C.H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” inProc. SOSP, 2023

  3. [3]

    DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving,

    Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving,” inProc. OSDI, 2024

  4. [4]

    DynamoLLM: De- signing LLM inference clusters for performance and energy efficiency,

    J. Stojkovic, C. Zhang, İ. Goiri, J. Torrellas, and E. Choukse, “DynamoLLM: De- signing LLM inference clusters for performance and energy efficiency,” inProc. HPCA, IEEE, 2025

  5. [5]

    Helix: Serving large language models over heterogeneous GPUs and network via max-flow,

    Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving large language models over heterogeneous GPUs and network via max-flow,” inProc. ASPLOS, 2025

  6. [6]

    Demystifyingcost-efficiencyinLLMservingoverheterogeneousGPUs,

    Y. Jiang, F. Fu, X. Yao, G. He, X. Miao, A. Klimovic, B. Cui, B. Yuan, and E.Yoneki,“Demystifyingcost-efficiencyinLLMservingoverheterogeneousGPUs,” inProc. ICML, 2025

  7. [7]

    Greedy randomized adaptive search procedures,

    T. A. Feo and M. G. C. Resende, “Greedy randomized adaptive search procedures,” J. Global Optim., vol. 6, pp. 109–133, 1995

  8. [8]

    Gurobi optimizer reference manual,

    Gurobi Optimization, LLC, “Gurobi optimizer reference manual,” 2024. [Online]. Available:https://www.gurobi.com

  9. [9]

    SeaLLM: Resource sharing for multi-LLM services,

    Y. Zhao, J. Chen, P. Sun, L. Li, X. Liu, and X. Jin, “SeaLLM: Resource sharing for multi-LLM services,” inProc. NSDI, 2025

  10. [10]

    Offline energy-optimal llm serv- ing: Workload-based energy models for llm inference on heterogeneous systems,

    G. Wilkins, S. Keshav, and R. Mortier, “Offline energy-optimal LLM serving,” arXiv preprint arXiv:2407.04014, 2024

  11. [11]

    SkyLB: Locality-aware cross-region load balancing for LLM serving,

    T. Xia, Z. Mao, J. Kerney, E.J. Jackson, Z. Li, J. Xing, S. Shenker, and I. Stoica, “SkyLB: Locality-aware cross-region load balancing for LLM serving,” inProc. SIGCOMM, 2025

  12. [12]

    Cost-efficient LLM serving with heterogeneous VMs and KV cache offloading,

    K. Kim,et al., “Cost-efficient LLM serving with heterogeneous VMs and KV cache offloading,” inProc. EuroSys, 2025

  13. [13]

    Azure LLM inference trace,

    Microsoft Research, “Azure LLM inference trace,” 2025. [Online]. Available:https: //github.com/Azure/AzurePublicDataset

  14. [14]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey,et al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Efficiently scaling transformer inference,

    R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” inProc. ML- Sys, vol. 5, 2023

  16. [16]

    GPTQ: Accurate post- trainingquantizationforgenerativepre-trainedtransformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post- trainingquantizationforgenerativepre-trainedtransformers,” inProc. ICLR,2023

  17. [17]

    Splitwise:EfficientgenerativeLLMinferenceusingphasesplitting,

    P.Patel,et al.,“Splitwise:EfficientgenerativeLLMinferenceusingphasesplitting,” inProc. ISCA, 2024

  18. [18]

    Efficientlarge-scalelanguagemodeltrainingonGPUclusters using Megatron-LM,

    D.Narayanan,et al.,“Efficientlarge-scalelanguagemodeltrainingonGPUclusters using Megatron-LM,” inProc. SC, 2021