pith. sign in

arxiv: 2604.07472 · v2 · pith:G2HV7X5Snew · submitted 2026-04-08 · 💻 cs.LG · cs.NI

Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.LG cs.NI
keywords LLM inferenceheterogeneous GPUsSLO constraintsgreedy heuristicsmixed-scale allocationscalabilityGPU provisioningAzure trace
0
0 comments X

The pith

Two heuristics allocate mixed-scale LLMs on heterogeneous GPUs in under one second while meeting SLOs and approaching optimal cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deploying LLMs at scale requires choosing base models, GPU types, parallelism settings, and workload splits under tight rules for latency, accuracy, and spending. Exact mixed-integer solvers guarantee the best answer but become too slow as problem size grows. The paper develops a simple Greedy Heuristic for one-pass decisions and an Adaptive Greedy Heuristic that improves it with multiple construction attempts, local moves, and GPU consolidation. Three targeted checks keep every choice inside the memory, delay, error, and budget limits at once. On traces modeled after real Azure LLM workloads, the methods finish quickly, keep costs close to the best known, and hold up when model sizes increase beyond the training data.

Core claim

The Adaptive Greedy Heuristic, built from a basic greedy pass plus multi-start construction, relocate local search, and consolidation, together with TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade, yields feasible allocations in under one second that closely match MILP optimal cost and preserve controlled SLO violations under 1.5x parameter inflation, while the exact solver degrades sharply.

What carries the argument

The Adaptive Greedy Heuristic (AGH) with multi-start construction, relocate-based local search, GPU consolidation, and the three constraint-aware mechanisms of TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade.

If this is right

  • Real-time reallocation becomes feasible for continuously arriving inference requests.
  • Operational cost stays near the theoretical minimum without long computation delays.
  • Deployments remain stable when model parameter counts shift after initial planning.
  • Exact solvers can be reserved for small problems while heuristics handle production scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structure could support online scheduling in cloud systems that adjust GPU pools every few minutes.
  • Similar local-search refinements might transfer to other heterogeneous hardware allocation settings such as mixed CPU-GPU training clusters.
  • Pre-computing a small library of good starting allocations could cut the multi-start overhead even further.

Load-bearing premise

The three constraint-aware mechanisms can always produce feasible allocations when memory, delay, error, and budget constraints are tightly coupled.

What would settle it

A workload instance where the heuristics output an allocation that violates SLOs or budget while the exact MILP solver finds a feasible lower-cost solution, or where either heuristic takes more than one second on the paper's large-scale instances.

Figures

Figures reproduced from arXiv: 2604.07472 by Duong Tung Nguyen, Jiaming Cheng.

Figure 1
Figure 1. Figure 1: System model: users submit queries classified by type, routed to heterogeneous GPU tiers hosting foundation models under TP/PP configurations. et al. [12] study cost-efficient serving with heterogeneous VMs and KV cache offloading, SeaLLM [9] enables multi-LLM resource sharing, and SkyLB [11] proposes locality-aware cross-region load balancing. These works demonstrate the importance of heterogeneity-aware … view at source ↗
Figure 2
Figure 2. Figure 2: (a)–(c): Model comparison under delay/error stress. (d)–(f): AGH sensitivity analysis over SLO thresholds ∆i, ϵi, budget δ, and rental cost p c [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constraints. While mixed-integer linear programming (MILP) can model this problem, its computational cost limits frequent re-optimization under demand variability. Existing heuristics often optimize individual components separately and may become infeasible when system-wide constraints are enforced. This paper presents a scalable framework for SLO-constrained LLM inference. We formulate the problem as an MILP with a two-phase delay model capturing both prefill and autoregressive decoding under tensor and pipeline parallelism. To solve it efficiently, we develop two constraint-aware heuristics: a Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH). AGH extends GH through multi-start construction, local search, and GPU consolidation. Both methods maintain feasibility through parallelism-aware filtering, cost-based ranking, and adaptive parallelism scaling. Experiments based on the Azure LLM Inference Trace show that GH generates feasible solutions within one second, while AGH achieves near-optimal performance within three seconds and scales to large instances where exact solvers fail to converge. Under out-of-sample stress with up to 1.5x delay and accuracy inflation, AGH degrades gracefully through provisioned headroom, yielding substantially lower cost and SLO violations than cost-minimal MILP solutions. Across synthetic and real Azure workloads, AGH maintains SLO compliance at significantly lower cost than exact MILP solutions. These results demonstrate that high-quality allocations provide substantial robustness to demand variability while enabling rapid adaptation to workload changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes two constraint-aware heuristics—a single-pass Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH) that adds multi-start construction, relocate-based local search, and GPU consolidation—for jointly selecting base models, heterogeneous GPUs, parallelism degrees, and workload distributions under coupled memory, latency, accuracy, and budget constraints. On workloads derived from the Azure LLM Inference Trace (2025), both heuristics return feasible solutions in under one second; AGH approaches the cost of an exact MILP solver while delivering >260× speedup on large instances. Under out-of-sample stress tests with up to 1.5× parameter inflation, AGH keeps SLO violations controlled and cost stable, whereas the exact solver degrades.

Significance. If the reported speedups and robustness hold, the work would be significant for practical LLM serving systems: it shows that carefully designed, constraint-aware heuristics can make mixed-integer allocation tractable at scale without sacrificing feasibility or solution quality. The explicit use of a public trace for calibration plus controlled out-of-sample inflation provides a reproducible empirical foundation that is stronger than purely synthetic evaluations.

minor comments (3)
  1. [Abstract] Abstract: the performance claims (sub-second runtimes, 260× speedup, controlled violations) are stated without any reference to the methods section or to the three named mechanisms (TP-aware feasibility selection, cost-per-effective-coverage ranking, TP upgrade); a single sentence summarizing how these mechanisms enforce feasibility would make the abstract self-contained.
  2. [Evaluation] Evaluation section: while the text describes success on the Azure trace and 1.5× inflation tests, no summary table or figure reports the exact cost ratios, violation counts, or runtime distributions across instance sizes; adding such a table would allow readers to assess the “closely approaching optimal” claim quantitatively.
  3. [Methods] The manuscript would benefit from a short pseudocode listing or algorithmic sketch of AGH (multi-start + relocate + consolidation) to complement the prose description of the three constraint-aware mechanisms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work on the GH and AGH heuristics for mixed-scale LLM allocation under SLO constraints, as well as for highlighting the significance of the speedups, feasibility guarantees, and reproducible evaluation on the Azure trace with out-of-sample stress tests. The recommendation for minor revision is noted; we will incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents two new constraint-aware heuristics (GH and AGH) for mixed-scale LLM allocation under SLO constraints and evaluates them empirically against an exact MILP solver on the Azure LLM Inference Trace (2025) plus out-of-sample stress tests. The reported results (feasibility in <1s, 260x speedup, cost proximity, controlled SLO violations) are direct measurements of algorithm runtime and solution quality on external trace data; no equations reduce these outcomes to fitted parameters defined by the same data, no self-citations bear the central claim, and no ansatz or uniqueness theorem is invoked to force the result. The derivation chain consists of algorithmic construction plus benchmark comparison and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the Azure trace for calibration and the assumption that the three named mechanisms suffice for feasibility; no free parameters, new mathematical axioms, or invented entities are introduced.

axioms (1)
  • domain assumption The Azure LLM Inference Trace (2025) provides representative workloads for evaluating allocation heuristics under realistic request patterns.
    Workloads are calibrated with this trace; the claim of controlled SLO violations under stress tests depends on it being a fair proxy for production.

pith-pipeline@v0.9.0 · 5490 in / 1456 out tokens · 55947 ms · 2026-05-10T17:57:20.385200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.