Throughput-Optimal Multiresource-Job Scheduling with Continuous Requirement Distribution

Heyuan Yao; Izzy Grosof; Willow Kowalik

arxiv: 2605.21715 · v1 · pith:WEXLN2WTnew · submitted 2026-05-20 · 💻 cs.PF

Throughput-Optimal Multiresource-Job Scheduling with Continuous Requirement Distribution

Heyuan Yao , Willow Kowalik , Izzy Grosof This is my paper

Pith reviewed 2026-05-22 08:30 UTC · model grok-4.3

classification 💻 cs.PF

keywords multiresource job schedulingthroughput optimalitycontinuous distributiondiscretizationpreemptive schedulingnonpreemptive schedulingqueueing modelsdatacenter scheduling

0 comments

The pith

Throughput-optimal scheduling policies exist for multiresource jobs with continuously distributed resource requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first family of scheduling policies proven to achieve throughput optimality in the continuous multiresource job model, covering both preemptive and nonpreemptive cases. Practical systems rarely repeat exact resource values such as CPU and memory, so continuous distributions match reality far better than the small number of classes assumed in prior theory. The approach works by discretizing the continuous requirements into a finite set of classes, with the bin width selected from the system load and the underlying distribution to keep the optimality guarantee intact. Several faster-to-compute variants are also given that stay optimal under mild extra assumptions on the distribution. The policies match or exceed existing index-based methods on both synthetic continuous distributions and real Google Borg scheduler traces.

Core claim

We introduce the first throughput-optimal family of scheduling policies for the continuous MRJ model, with both preemptive and nonpreemptive variants. We use a discretization approach, where we choose the discretization granularity based on the system load and the distribution of resource requirements. We further introduce several efficient policy families, which remain throughput-optimal while considerably improving computational efficiency, under some distributional assumptions.

What carries the argument

Discretization of the continuous resource-requirement distribution into a finite number of classes, with granularity chosen from system load and the requirement distribution to preserve throughput optimality.

If this is right

Throughput optimality holds for both preemptive and non-preemptive policies in the continuous MRJ model.
More computationally efficient policy families remain throughput-optimal when the requirement distribution satisfies additional assumptions.
The policies achieve state-of-the-art performance on parametrized continuous distributions and on Google Borg datacenter traces.
The discretization technique directly bridges class-based theoretical models to the continuous distributions observed in production systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same discretization idea could be tested on other resource-allocation problems that currently assume only a few discrete classes.
Adaptive versions that adjust bin width on the fly from observed load might further reduce overhead while keeping optimality.
The approach suggests that many existing index policies could be made optimal for continuous settings by applying a similar load-dependent discretization step.

Load-bearing premise

The chosen discretization granularity is fine enough to preserve throughput optimality when the underlying model is truly continuous rather than already discrete.

What would settle it

A workload whose resource requirements follow a continuous distribution in which the proposed policy's long-run throughput falls measurably below the maximum feasible rate that would be achieved by an ideal fluid scheduler.

read the original abstract

Modern computing systems process jobs with resource requirements such as CPU and memory, which are described by multiresource jobs (MRJ) queueing models. In practice, job resource requirements are spread out over so many values, that it is rare to see the same value twice. This pattern is best modeled by a continuous distribution of requirement values. However, the existing theoretical work on stability or throughput-optimality focuses on queueing models with class-based resource requirements. In class-based models, the number of distinct resource requirements must be small to demonstrate strong empirical performance, making them a poor match for these practical systems. We introduce the first throughput-optimal family of scheduling policies for the continuous MRJ model, with both preemptive and nonpreemptive variants. We further introduce several efficient policy families, which remain throughput-optimal while considerably improving computational efficiency, under some distributional assumptions. We use a discretization approach, where we choose the discretization granularity based on the system load and the distribution of resource requirements. We validate the real-world applicability of our policies by comparing them against existing index-based policies on parametrized distributions and on datacenter trace data from the Google Borg scheduler, demonstrating state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The discretization method offers a practical way to handle continuous resource requirements in MRJ scheduling and claims the first throughput-optimal policies for that setting, but the stability argument for the continuous case needs tighter error bounds to be fully convincing.

read the letter

The main point is that this paper gives the first throughput-optimal scheduling policies for multiresource jobs where requirements follow a continuous distribution rather than a small set of discrete classes. They do this with a discretization step whose granularity depends on the load and the requirement distribution, and they supply both preemptive and non-preemptive versions plus some faster variants under extra assumptions on the distribution. That directly addresses the gap between theory and the kind of workloads seen in real datacenters, where identical requirements are rare. The trace validation on Google Borg data and the comparison to index policies is a useful check on whether the approach translates outside synthetic cases. They also show the policies remain efficient enough to be usable. The soft spot is the stability analysis. The construction maps continuous jobs to discrete classes and then applies an optimal policy on the discretized system, but it is not obvious from the abstract whether the fluid-limit or Lyapunov drift stays negative outside the capacity region once the mapping error is included. If the proof only controls the discretized system and then appeals to continuity without explicit uniform bounds that vanish for arbitrary distributions, the exact optimality claim for the original continuous model could be weaker than stated. The stress-test concern about residual positive drift from the mapping looks like the part that needs the closest look in the full derivation. This work is for people who build or analyze scheduling policies for heterogeneous resources in performance modeling and datacenter systems. A reader who cares about closing the gap between class-based theory and continuous practice would get concrete value from the construction and the empirical checks. It is coherent on its own terms and shows honest engagement with the modeling issue, so it deserves a serious referee to verify the proof details and the error control. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the first throughput-optimal family of scheduling policies for the continuous multiresource-job (MRJ) model, using a discretization approach whose granularity is selected from the system load and the resource-requirement distribution. Both preemptive and nonpreemptive variants are presented, together with computationally efficient sub-families that remain optimal under additional distributional assumptions. The policies are validated against index-based baselines on parametrized distributions and on Google Borg trace data.

Significance. If the stability argument extends rigorously to arbitrary continuous distributions, the result would close an important gap between existing class-based throughput-optimality theory and the continuous requirement distributions observed in production systems. The load- and distribution-dependent discretization rule, the provision of both preemptive and nonpreemptive policies, and the empirical evaluation on real datacenter traces are concrete strengths that would make the work influential for both theory and practice.

major comments (2)

[Section on policy construction and stability analysis] Section on policy construction and stability analysis: the central claim that a policy optimal on the finite discretization remains throughput-optimal for the original continuous model requires explicit uniform error bounds showing that the mapping-induced residual drift vanishes in the fluid limit. The current argument appears to establish stability only for the discretized system and then invokes a continuity step; without these bounds the exact optimality result for arbitrary continuous distributions is not yet established.
[Theorem or proposition establishing throughput optimality] Theorem or proposition establishing throughput optimality: the proof must demonstrate that the chosen discretization granularity (derived from load and the requirement distribution) preserves the capacity region boundary for every distribution in the considered class; a concrete counter-example or a uniform convergence argument is needed to rule out distributions for which the mapping error leaves a positive drift outside the region.

minor comments (2)

[Policy construction] Clarify in the policy-construction section exactly how the discretization points are chosen as a function of the load vector and the cumulative distribution function; the current description leaves the selection rule implicit.
[Trace validation] In the trace-validation experiments, report the number of independent runs or bootstrap confidence intervals for the throughput and delay metrics so that the claimed state-of-the-art performance can be assessed statistically.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the current stability argument would benefit from more explicit bounds to rigorously close the gap between the discretized and continuous models. We address each point below and will revise the manuscript to incorporate the requested strengthening of the proofs.

read point-by-point responses

Referee: Section on policy construction and stability analysis: the central claim that a policy optimal on the finite discretization remains throughput-optimal for the original continuous model requires explicit uniform error bounds showing that the mapping-induced residual drift vanishes in the fluid limit. The current argument appears to establish stability only for the discretized system and then invokes a continuity step; without these bounds the exact optimality result for arbitrary continuous distributions is not yet established.

Authors: We agree that the continuity step requires additional justification to be fully rigorous. The existing argument establishes fluid stability for the finite-class discretized system via standard Lyapunov-drift techniques and then appeals to continuity of the mapping. In the revision we will insert a new lemma that derives explicit uniform error bounds on the residual drift term induced by the discretization. These bounds will be expressed in terms of the chosen granularity (itself a function of load and the requirement distribution) and will show that the error is o(1) under fluid scaling, ensuring the drift remains strictly negative outside the capacity region for the original continuous model. This addition will confirm throughput optimality for arbitrary continuous distributions satisfying the paper's moment assumptions. revision: yes
Referee: Theorem or proposition establishing throughput optimality: the proof must demonstrate that the chosen discretization granularity (derived from load and the requirement distribution) preserves the capacity region boundary for every distribution in the considered class; a concrete counter-example or a uniform convergence argument is needed to rule out distributions for which the mapping error leaves a positive drift outside the region.

Authors: We acknowledge the need for an explicit demonstration that the capacity-region boundary is preserved. The granularity rule is constructed precisely so that the discretization error in the service-rate vectors is controlled uniformly by the load and the variation of the requirement distribution. In the revised version we will add a uniform convergence argument (based on the continuity of the expectation operator under the chosen partition) showing that the approximated capacity region converges to the true region at a rate sufficient to keep any residual drift negative outside the boundary. This argument applies uniformly over the class of continuous distributions with finite second moments, thereby ruling out distributions that could produce a persistent positive drift. Should the referee have a specific counter-example distribution in mind, we would be grateful for the details so that we may verify the bound directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain.

full rationale

The paper derives throughput-optimal policies for the continuous MRJ model via a discretization approach that selects granularity from system load and the resource-requirement distribution. The central claim of optimality for the continuous model is supported by stability analysis that maps the continuous system to a discrete one while preserving the capacity region boundary. No quoted equations or self-citations reduce the optimality result to a fitted parameter, self-definition, or prior author work by construction. The discretization rule is presented as derived from load and distributional properties rather than assumed, and the proof chain remains independent of the target result. This is the expected self-contained case for a paper introducing a new policy family with explicit construction and analysis steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling choice that resource requirements are drawn from a continuous distribution and on the claim that a load-dependent discretization preserves throughput optimality.

axioms (1)

domain assumption Resource requirements follow a continuous distribution that can be discretized at a granularity chosen from system load and the distribution itself.
Invoked to justify moving from class-based to continuous MRJ models and to select discretization.

pith-pipeline@v0.9.0 · 5743 in / 1161 out tokens · 55365 ms · 2026-05-22T08:30:26.295130+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

Operations research 28(6), 1335–1346 (1980)

Green, L.: A queueing system in which customers require a random number of servers. Operations research 28(6), 1335–1346 (1980)

work page 1980
[2]

Mana gement Science 30(1), 51–68 (1984)

Brill, P.H., Green, L.: Queues in which customers receive simultaneou s service from a random number of servers: a system point approach. Mana gement Science 30(1), 51–68 (1984)

work page 1984
[3]

IEEE Transactions on Information theory 39(2), 466–478 (1993)

Tassiulas, L., Ephremides, A.: Dynamic server allocation to parallel queues with randomly varying connectivity. IEEE Transactions on Information theory 39(2), 466–478 (1993)

work page 1993
[4]

Algorithmica 29(1), 70–88 (2001) 45

Coﬀman, E., Stolyar, A.L.: Bandwidth packing. Algorithmica 29(1), 70–88 (2001) 45

work page 2001
[5]

The Annals of Applied Probab ility 14(1), 1–53 (2004)

Stolyar, A.L.: Maxweight scheduling in a generalized switch: State s pace collapse and workload minimization in heavy traﬃc. The Annals of Applied Probab ility 14(1), 1–53 (2004)

work page 2004
[6]

In: 2012 Proceedings IEEE I nfocom, pp

Maguluri, S.T., Srikant, R., Ying, L.: Stochastic models of load balanc ing and scheduling in cloud computing clusters. In: 2012 Proceedings IEEE I nfocom, pp. 702–710 (2012). IEEE

work page 2012
[7]

Annals of Operations Research 252(1), 29–39 (2017)

Rumyantsev, A., Morozov, E.: Stability criterion of a multiserver m odel with simultaneous service. Annals of Operations Research 252(1), 29–39 (2017)

work page 2017
[8]

IEEE/ACM Transactions on Networking 26(5), 2202–2215 (2018)

Psychas, K., Ghaderi, J.: Randomized algorithms for scheduling mu lti-resource jobs in the cloud. IEEE/ACM Transactions on Networking 26(5), 2202–2215 (2018)

work page 2018
[9]

Queueing Systems 102(1), 143–174 (2022)

Grosof, I., Harchol-Balter, M., Scheller-Wolf, A.: WCFS: A new fra mework for analyzing multiserver systems. Queueing Systems 102(1), 143–174 (2022)

work page 2022
[10]

Proceedings of t he ACM on Measurement and Analysis of Computing Systems 6(3), 1–32 (2022)

Grosof, I., Scully, Z., Harchol-Balter, M., Scheller-Wolf, A.: Optim al scheduling in the multiserver-job model under heavy traﬃc. Proceedings of t he ACM on Measurement and Analysis of Computing Systems 6(3), 1–32 (2022)

work page 2022
[11]

Perf ormance Evaluation 162, 102378 (2023)

Grosof, I., Hong, Y., Harchol-Balter, M., Scheller-Wolf, A.: The R ESET and MARC techniques, with application to multiserver-job analysis. Perf ormance Evaluation 162, 102378 (2023)

work page 2023
[12]

In: Pro- ceedings of the Twenty-Third International Symposium on Theory , Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Co mputing, pp

Hong, Y., Wang, W.: Sharp waiting-time bounds for multiserver jo bs. In: Pro- ceedings of the Twenty-Third International Symposium on Theory , Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Co mputing, pp. 161–170 (2022)

work page 2022
[13]

Performance Evaluation 171, 102525 (2026) https://doi.org/10.1016/j.peva.2025.102525

Chen, Z., Anggraito, A., Olliaro, D., Marin, A., Marsan, M.A., Berg, B., Grosof, I.: Improving nonpreemptive multiserver job schedul- ing with quickswap. Performance Evaluation 171, 102525 (2026) https://doi.org/10.1016/j.peva.2025.102525

work page doi:10.1016/j.peva.2025.102525 2026
[14]

Pro- ceedings of the ACM on Measurement and Analysis of Computing Syst ems 1(2), 1–29 (2017)

Psychas, K., Ghaderi, J.: On non-preemptive VM scheduling in the cloud. Pro- ceedings of the ACM on Measurement and Analysis of Computing Syst ems 1(2), 1–29 (2017)

work page 2017
[15]

Proceedings of the ACM on Measurement a nd Analysis of Computing Systems 9(2), 1–36 (2025)

Chen, Z., Grosof, I., Berg, B.: Improving multiresource job sch eduling with marko- vian service rate policies. Proceedings of the ACM on Measurement a nd Analysis of Computing Systems 9(2), 1–36 (2025)

work page 2025
[16]

In: Proceedings o f the Fifteenth European Conference on Computer Systems, pp

Tirmazi, M., Barker, A., Deng, N., Haque, M.E., Qin, Z.G., Hand, S., Ha rchol- Balter, M., Wilkes, J.: Borg: the next generation. In: Proceedings o f the Fifteenth European Conference on Computer Systems, pp. 1–14 (2020) 46

work page 2020
[17]

In: 2024 ACM/IEEE 51st Annual International Symposium on Computer Arc hitecture (ISCA), pp

Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, ´I., Maleki, S., Bianchini, R.: Splitwise: Eﬃcient generative LLM inference using phase splitting. In: 2024 ACM/IEEE 51st Annual International Symposium on Computer Arc hitecture (ISCA), pp. 118–132 (2024). IEEE

work page 2024
[18]

In: Procee dings of the Third ACM Symposium on Cloud Computing, pp

Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Het erogeneity and dynamicity of clouds at scale: Google trace analysis. In: Procee dings of the Third ACM Symposium on Cloud Computing, pp. 1–13 (2012)

work page 2012
[19]

In: Pro- ceedings of the European Conference on Computer Systems (Eur oSys’15), Bordeaux, France (2015)

Verma, A., Pedrosa, L., Korupolu, M.R., Oppenheimer, D., Tune, E ., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Pro- ceedings of the European Conference on Computer Systems (Eur oSys’15), Bordeaux, France (2015). https://doi.org/10.1145/2741948.2741964 . https://dl.acm.org/doi/10.1145/2741948.2741964

work page doi:10.1145/2741948.2741964 2015
[20]

Google research blog, Mountain View, CA, USA

Wilkes, J.: Yet more Google compute cluster trace data. Google research blog, Mountain View, CA, USA. Posted at https://ai.googleblog.com/2020/04/yet-more-google-compute- cluster-trace.html. (2020)

work page 2020
[21]

Technical report, Google Inc., Mountain View, CA, USA (April 2020)

Wilkes, J.: Google cluster-usage traces v3. Technical report, Google Inc., Mountain View, CA, USA (April 2020). Posted at https://github.com/google/cluster-data/blob/master/Cluster Data2019.md

work page 2020
[22]

Google research blog, Mountain View, CA, USA

Wilkes, J.: More Google cluster data. Google research blog, Mountain View, CA, USA. Posted at http://googleresearch.blogspot.com/2011/11/more-google-clu ster-data.html. (2011)

work page 2011
[23]

Technical report, Google Inc., Mountain View, CA, USA (November 2011)

Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage trace s: for- mat + schema. Technical report, Google Inc., Mountain View, CA, USA (November 2011). Revised 2014-11-17 for version 2.1. Poste d at https://github.com/google/cluster-data

work page 2011
[24]

In: Proceedings o f the 8th ACM European Conference on Computer Systems, pp

Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omeg a: ﬂexible, scalable schedulers for large compute clusters. In: Proceedings o f the 8th ACM European Conference on Computer Systems, pp. 351–364 (2013 )

work page 2013
[25]

In: Proceedings of the International Symposium on Quality of Serv ice, pp

Guo, J., Chang, Z., Wang, S., Ding, H., Feng, Y., Mao, L., Bao, Y.: Wh o limits the resource eﬃciency of my datacenter: An analysis of Alibaba dat acenter traces. In: Proceedings of the International Symposium on Quality of Serv ice, pp. 1–10 (2019)

work page 2019
[26]

In: Proceedings of the 4th International Con ference on Big Data and Computing, pp

Li, F., Hu, B.: Deepjs: Job scheduling based on deep reinforceme nt learning in cloud data center. In: Proceedings of the 4th International Con ference on Big Data and Computing, pp. 48–53 (2019) 47

work page 2019
[27]

In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Wu, H., Zhang, W., Xu, Y., Xiang, H., Huang, T., Ding, H., Zhang, Z.: A laddin: Optimized maximum ﬂow management for shared production clusters . In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 696–707 (2019). IEEE

work page 2019
[28]

arXiv preprint arXiv:2509.08207 (2025)

Allen, B.S., Anchell, J., Anisimov, V., Applencourt, T., Bagusetty, A ., Balakr- ishnan, R., Balin, R., Bekele, S., Bertoni, C., Blackworth, C., et al.: Auro ra: Architecting argonne’s ﬁrst exascale supercomputer for acceler ated scientiﬁc discovery. arXiv preprint arXiv:2509.08207 (2025)

work page arXiv 2025
[29]

In: In ternational Conference on High Performance Computing, pp

Budiardja, R.D., Berrill, M., Eisenbach, M., Jansen, G.R., Joubert, W., Nichols, S., Rogers, D.M., Tharrington, A., Bronson Messer, O.: Ready for th e frontier: Preparing applications for the world’s ﬁrst exascale system. In: In ternational Conference on High Performance Computing, pp. 182–201 (2023) . Springer

work page 2023
[30]

: Fron- tier: exploring exascale

Atchley, S., Zimmer, C., Lange, J., Bernholdt, D., Melesse Vergar a, V., Beck, T., Brim, M., Budiardja, R., Chandrasekaran, S., Eisenbach, M., et al. : Fron- tier: exploring exascale. In: Proceedings of the International Co nference for High Performance Computing, Networking, Storage and Analysis, pp. 1 –16 (2023)

work page 2023
[31]

In: Proc eedings of the Fifteenth European Conference on Computer Systems (Eu roSys’20)

Tirmazi, M., Barker, A., Deng, N., Haque, M.E., Qin, Z.G., Hand, S., Harchol-Balter, M., Wilkes, J.: Borg: the Next Generation. In: Proc eedings of the Fifteenth European Conference on Computer Systems (Eu roSys’20). ACM, Heraklion, Greece (2020). https://doi.org/10.1145/3342195.3387517 . https://doi.org/10.1145/3342195.3387517

work page doi:10.1145/3342195.3387517 2020
[32]

Proceedings of the London Mathematical Society 2(1), 75–115 (1918)

Hardy, G.H., Ramanujan, S.: Asymptotic formulae in combinatory analysis. Proceedings of the London Mathematical Society 2(1), 75–115 (1918)

work page 1918
[33]

Experimental Mathematics 32(3), 457–466 (2023)

Shlyk, V.A.: Number of vertices of the polytope of integer partit ions and fac- torization of the partitioned number. Experimental Mathematics 32(3), 457–466 (2023)

work page 2023
[34]

Advances in Applied Proba bility 25(3), 518–548 (1993)

Meyn, S.P., Tweedie, R.L.: Stability of markovian processes III: F oster–Lyapunov criteria for continuous-time processes. Advances in Applied Proba bility 25(3), 518–548 (1993)

work page 1993
[35]

In: Proceedings of the Workshop on St ochastic Stability and Stochastic Stabilization, Metz, France (1993)

Meyn, S.P., Tweedie, R.: A survey of Foster-Lyapunov techniqu es for general state space markov processes. In: Proceedings of the Workshop on St ochastic Stability and Stochastic Stabilization, Metz, France (1993)

work page 1993
[36]

In: Markov Processes and Related Topics: a Festschrift for Thom as G

Glynn, P.W., Zeevi, A.: Bounding stationary expectations of mark ov processes. In: Markov Processes and Related Topics: a Festschrift for Thom as G. Kurtz vol. 4, pp. 195–215. Institute of Mathematical Statistics, ??? (2008 )

work page 2008
[37]

In: 2024 IEEE 25th International Confe rence on High 48 Performance Switching and Routing (HPSR), pp

Yildiz, M., Baiocchi, A.: Data-driven workload generation based on google data center measurements. In: 2024 IEEE 25th International Confe rence on High 48 Performance Switching and Routing (HPSR), pp. 143–148 (2024). IEEE Appendix A List of Notation General Setting: Symbol Description λ Arrival rate d The number of resources in the MRJ setting V Resource...

work page 2024

[1] [1]

Operations research 28(6), 1335–1346 (1980)

Green, L.: A queueing system in which customers require a random number of servers. Operations research 28(6), 1335–1346 (1980)

work page 1980

[2] [2]

Mana gement Science 30(1), 51–68 (1984)

Brill, P.H., Green, L.: Queues in which customers receive simultaneou s service from a random number of servers: a system point approach. Mana gement Science 30(1), 51–68 (1984)

work page 1984

[3] [3]

IEEE Transactions on Information theory 39(2), 466–478 (1993)

Tassiulas, L., Ephremides, A.: Dynamic server allocation to parallel queues with randomly varying connectivity. IEEE Transactions on Information theory 39(2), 466–478 (1993)

work page 1993

[4] [4]

Algorithmica 29(1), 70–88 (2001) 45

Coﬀman, E., Stolyar, A.L.: Bandwidth packing. Algorithmica 29(1), 70–88 (2001) 45

work page 2001

[5] [5]

The Annals of Applied Probab ility 14(1), 1–53 (2004)

Stolyar, A.L.: Maxweight scheduling in a generalized switch: State s pace collapse and workload minimization in heavy traﬃc. The Annals of Applied Probab ility 14(1), 1–53 (2004)

work page 2004

[6] [6]

In: 2012 Proceedings IEEE I nfocom, pp

Maguluri, S.T., Srikant, R., Ying, L.: Stochastic models of load balanc ing and scheduling in cloud computing clusters. In: 2012 Proceedings IEEE I nfocom, pp. 702–710 (2012). IEEE

work page 2012

[7] [7]

Annals of Operations Research 252(1), 29–39 (2017)

Rumyantsev, A., Morozov, E.: Stability criterion of a multiserver m odel with simultaneous service. Annals of Operations Research 252(1), 29–39 (2017)

work page 2017

[8] [8]

IEEE/ACM Transactions on Networking 26(5), 2202–2215 (2018)

Psychas, K., Ghaderi, J.: Randomized algorithms for scheduling mu lti-resource jobs in the cloud. IEEE/ACM Transactions on Networking 26(5), 2202–2215 (2018)

work page 2018

[9] [9]

Queueing Systems 102(1), 143–174 (2022)

Grosof, I., Harchol-Balter, M., Scheller-Wolf, A.: WCFS: A new fra mework for analyzing multiserver systems. Queueing Systems 102(1), 143–174 (2022)

work page 2022

[10] [10]

Proceedings of t he ACM on Measurement and Analysis of Computing Systems 6(3), 1–32 (2022)

Grosof, I., Scully, Z., Harchol-Balter, M., Scheller-Wolf, A.: Optim al scheduling in the multiserver-job model under heavy traﬃc. Proceedings of t he ACM on Measurement and Analysis of Computing Systems 6(3), 1–32 (2022)

work page 2022

[11] [11]

Perf ormance Evaluation 162, 102378 (2023)

Grosof, I., Hong, Y., Harchol-Balter, M., Scheller-Wolf, A.: The R ESET and MARC techniques, with application to multiserver-job analysis. Perf ormance Evaluation 162, 102378 (2023)

work page 2023

[12] [12]

In: Pro- ceedings of the Twenty-Third International Symposium on Theory , Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Co mputing, pp

Hong, Y., Wang, W.: Sharp waiting-time bounds for multiserver jo bs. In: Pro- ceedings of the Twenty-Third International Symposium on Theory , Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Co mputing, pp. 161–170 (2022)

work page 2022

[13] [13]

Performance Evaluation 171, 102525 (2026) https://doi.org/10.1016/j.peva.2025.102525

Chen, Z., Anggraito, A., Olliaro, D., Marin, A., Marsan, M.A., Berg, B., Grosof, I.: Improving nonpreemptive multiserver job schedul- ing with quickswap. Performance Evaluation 171, 102525 (2026) https://doi.org/10.1016/j.peva.2025.102525

work page doi:10.1016/j.peva.2025.102525 2026

[14] [14]

Pro- ceedings of the ACM on Measurement and Analysis of Computing Syst ems 1(2), 1–29 (2017)

Psychas, K., Ghaderi, J.: On non-preemptive VM scheduling in the cloud. Pro- ceedings of the ACM on Measurement and Analysis of Computing Syst ems 1(2), 1–29 (2017)

work page 2017

[15] [15]

Proceedings of the ACM on Measurement a nd Analysis of Computing Systems 9(2), 1–36 (2025)

Chen, Z., Grosof, I., Berg, B.: Improving multiresource job sch eduling with marko- vian service rate policies. Proceedings of the ACM on Measurement a nd Analysis of Computing Systems 9(2), 1–36 (2025)

work page 2025

[16] [16]

In: Proceedings o f the Fifteenth European Conference on Computer Systems, pp

Tirmazi, M., Barker, A., Deng, N., Haque, M.E., Qin, Z.G., Hand, S., Ha rchol- Balter, M., Wilkes, J.: Borg: the next generation. In: Proceedings o f the Fifteenth European Conference on Computer Systems, pp. 1–14 (2020) 46

work page 2020

[17] [17]

In: 2024 ACM/IEEE 51st Annual International Symposium on Computer Arc hitecture (ISCA), pp

Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, ´I., Maleki, S., Bianchini, R.: Splitwise: Eﬃcient generative LLM inference using phase splitting. In: 2024 ACM/IEEE 51st Annual International Symposium on Computer Arc hitecture (ISCA), pp. 118–132 (2024). IEEE

work page 2024

[18] [18]

In: Procee dings of the Third ACM Symposium on Cloud Computing, pp

Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Het erogeneity and dynamicity of clouds at scale: Google trace analysis. In: Procee dings of the Third ACM Symposium on Cloud Computing, pp. 1–13 (2012)

work page 2012

[19] [19]

In: Pro- ceedings of the European Conference on Computer Systems (Eur oSys’15), Bordeaux, France (2015)

Verma, A., Pedrosa, L., Korupolu, M.R., Oppenheimer, D., Tune, E ., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Pro- ceedings of the European Conference on Computer Systems (Eur oSys’15), Bordeaux, France (2015). https://doi.org/10.1145/2741948.2741964 . https://dl.acm.org/doi/10.1145/2741948.2741964

work page doi:10.1145/2741948.2741964 2015

[20] [20]

Google research blog, Mountain View, CA, USA

Wilkes, J.: Yet more Google compute cluster trace data. Google research blog, Mountain View, CA, USA. Posted at https://ai.googleblog.com/2020/04/yet-more-google-compute- cluster-trace.html. (2020)

work page 2020

[21] [21]

Technical report, Google Inc., Mountain View, CA, USA (April 2020)

Wilkes, J.: Google cluster-usage traces v3. Technical report, Google Inc., Mountain View, CA, USA (April 2020). Posted at https://github.com/google/cluster-data/blob/master/Cluster Data2019.md

work page 2020

[22] [22]

Google research blog, Mountain View, CA, USA

Wilkes, J.: More Google cluster data. Google research blog, Mountain View, CA, USA. Posted at http://googleresearch.blogspot.com/2011/11/more-google-clu ster-data.html. (2011)

work page 2011

[23] [23]

Technical report, Google Inc., Mountain View, CA, USA (November 2011)

Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage trace s: for- mat + schema. Technical report, Google Inc., Mountain View, CA, USA (November 2011). Revised 2014-11-17 for version 2.1. Poste d at https://github.com/google/cluster-data

work page 2011

[24] [24]

In: Proceedings o f the 8th ACM European Conference on Computer Systems, pp

Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omeg a: ﬂexible, scalable schedulers for large compute clusters. In: Proceedings o f the 8th ACM European Conference on Computer Systems, pp. 351–364 (2013 )

work page 2013

[25] [25]

In: Proceedings of the International Symposium on Quality of Serv ice, pp

Guo, J., Chang, Z., Wang, S., Ding, H., Feng, Y., Mao, L., Bao, Y.: Wh o limits the resource eﬃciency of my datacenter: An analysis of Alibaba dat acenter traces. In: Proceedings of the International Symposium on Quality of Serv ice, pp. 1–10 (2019)

work page 2019

[26] [26]

In: Proceedings of the 4th International Con ference on Big Data and Computing, pp

Li, F., Hu, B.: Deepjs: Job scheduling based on deep reinforceme nt learning in cloud data center. In: Proceedings of the 4th International Con ference on Big Data and Computing, pp. 48–53 (2019) 47

work page 2019

[27] [27]

In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Wu, H., Zhang, W., Xu, Y., Xiang, H., Huang, T., Ding, H., Zhang, Z.: A laddin: Optimized maximum ﬂow management for shared production clusters . In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 696–707 (2019). IEEE

work page 2019

[28] [28]

arXiv preprint arXiv:2509.08207 (2025)

Allen, B.S., Anchell, J., Anisimov, V., Applencourt, T., Bagusetty, A ., Balakr- ishnan, R., Balin, R., Bekele, S., Bertoni, C., Blackworth, C., et al.: Auro ra: Architecting argonne’s ﬁrst exascale supercomputer for acceler ated scientiﬁc discovery. arXiv preprint arXiv:2509.08207 (2025)

work page arXiv 2025

[29] [29]

In: In ternational Conference on High Performance Computing, pp

Budiardja, R.D., Berrill, M., Eisenbach, M., Jansen, G.R., Joubert, W., Nichols, S., Rogers, D.M., Tharrington, A., Bronson Messer, O.: Ready for th e frontier: Preparing applications for the world’s ﬁrst exascale system. In: In ternational Conference on High Performance Computing, pp. 182–201 (2023) . Springer

work page 2023

[30] [30]

: Fron- tier: exploring exascale

Atchley, S., Zimmer, C., Lange, J., Bernholdt, D., Melesse Vergar a, V., Beck, T., Brim, M., Budiardja, R., Chandrasekaran, S., Eisenbach, M., et al. : Fron- tier: exploring exascale. In: Proceedings of the International Co nference for High Performance Computing, Networking, Storage and Analysis, pp. 1 –16 (2023)

work page 2023

[31] [31]

In: Proc eedings of the Fifteenth European Conference on Computer Systems (Eu roSys’20)

Tirmazi, M., Barker, A., Deng, N., Haque, M.E., Qin, Z.G., Hand, S., Harchol-Balter, M., Wilkes, J.: Borg: the Next Generation. In: Proc eedings of the Fifteenth European Conference on Computer Systems (Eu roSys’20). ACM, Heraklion, Greece (2020). https://doi.org/10.1145/3342195.3387517 . https://doi.org/10.1145/3342195.3387517

work page doi:10.1145/3342195.3387517 2020

[32] [32]

Proceedings of the London Mathematical Society 2(1), 75–115 (1918)

Hardy, G.H., Ramanujan, S.: Asymptotic formulae in combinatory analysis. Proceedings of the London Mathematical Society 2(1), 75–115 (1918)

work page 1918

[33] [33]

Experimental Mathematics 32(3), 457–466 (2023)

Shlyk, V.A.: Number of vertices of the polytope of integer partit ions and fac- torization of the partitioned number. Experimental Mathematics 32(3), 457–466 (2023)

work page 2023

[34] [34]

Advances in Applied Proba bility 25(3), 518–548 (1993)

Meyn, S.P., Tweedie, R.L.: Stability of markovian processes III: F oster–Lyapunov criteria for continuous-time processes. Advances in Applied Proba bility 25(3), 518–548 (1993)

work page 1993

[35] [35]

In: Proceedings of the Workshop on St ochastic Stability and Stochastic Stabilization, Metz, France (1993)

Meyn, S.P., Tweedie, R.: A survey of Foster-Lyapunov techniqu es for general state space markov processes. In: Proceedings of the Workshop on St ochastic Stability and Stochastic Stabilization, Metz, France (1993)

work page 1993

[36] [36]

In: Markov Processes and Related Topics: a Festschrift for Thom as G

Glynn, P.W., Zeevi, A.: Bounding stationary expectations of mark ov processes. In: Markov Processes and Related Topics: a Festschrift for Thom as G. Kurtz vol. 4, pp. 195–215. Institute of Mathematical Statistics, ??? (2008 )

work page 2008

[37] [37]

In: 2024 IEEE 25th International Confe rence on High 48 Performance Switching and Routing (HPSR), pp

Yildiz, M., Baiocchi, A.: Data-driven workload generation based on google data center measurements. In: 2024 IEEE 25th International Confe rence on High 48 Performance Switching and Routing (HPSR), pp. 143–148 (2024). IEEE Appendix A List of Notation General Setting: Symbol Description λ Arrival rate d The number of resources in the MRJ setting V Resource...

work page 2024