"Two-Stagification": Job Dispatching in Large-Scale Clusters via a Two-Stage Architecture

Alexey Rolich; Andrea Baiocchi; Mert Yildiz

arxiv: 2505.03032 · v1 · submitted 2025-05-05 · 💻 cs.DC · cs.PF

"Two-Stagification": Job Dispatching in Large-Scale Clusters via a Two-Stage Architecture

Mert Yildiz , Alexey Rolich , Andrea Baiocchi This is my paper

Pith reviewed 2026-05-22 16:11 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords job dispatchingtwo-stage architectureservice-time thresholdlarge-scale clustersmean response timeRound RobinJoin Idle QueueLeast Work Left

0 comments

The pith

Two-stage architecture with service-time threshold improves dispatching performance in large clusters

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that a two-stage architecture for job dispatching, which applies classical policies after separating large jobs via an optimized service-time threshold, greatly improves mean response times in large clusters. A sympathetic reader would care because it shows how architectural design can deliver performance close to advanced methods without needing full job size or server state information at dispatch time. The demonstration uses synthetic Weibull workloads and real Google data center traces. This matters for large-scale computing environments where reducing mean response times is key to efficiency.

Core claim

We introduce a two-stage cluster architecture that applies classical Round Robin, Join Idle Queue, and Least Work Left dispatching schemes, coupled with an optimized service-time threshold to separate large jobs from shorter ones. Using both synthetic (Weibull) workloads and real Google data center traces, we demonstrate that our two-stage approach greatly improves upon the corresponding single-stage policies and closely approaches the performance of advanced size- and state-aware methods.

What carries the argument

The two-stage architecture coupled with an optimized service-time threshold to separate large jobs from shorter ones.

If this is right

Classical dispatching schemes perform significantly better when large jobs are handled in a separate stage.
The two-stage method closely approaches the performance of advanced size- and state-aware dispatching.
Results are consistent across synthetic Weibull workloads and real Google traces.
Architectural design choices can replace the need for increased dispatcher complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method highlights the potential of using limited information effectively through system structure.
It may extend to other distributed computing scenarios with heterogeneous job sizes.
Optimizing the threshold dynamically could be explored as an extension.

Load-bearing premise

An optimized service-time threshold exists that reliably separates large jobs from shorter ones to improve performance without requiring full knowledge of job sizes or server states at dispatch time.

What would settle it

Observing that the mean response time does not decrease with the two-stage architecture compared to single-stage for any tested threshold on the Google data center traces would falsify the claim.

Figures

Figures reproduced from arXiv: 2505.03032 by Alexey Rolich, Andrea Baiocchi, Mert Yildiz.

**Figure 2.** Figure 2: MRT as a function of the number of servers. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: MRT as a function of the system load. Note: COV = 10 and n = 10. 11 1 " # 1 11 1 11 1 1 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: MRT as a function of the number of servers. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: MRT as a function of the system load. Google data has [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: MRT as a function of the number of servers. Google da [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

A continuing effort is devoted to devising effective dispatching policies for clusters of First Come First Served servers. Although the optimal solution for dispatchers aware of both job size and server state remains elusive, lower bounds and strong heuristics are known. In this paper, we introduce a two-stage cluster architecture that applies classical Round Robin, Join Idle Queue, and Least Work Left dispatching schemes, coupled with an optimized service-time threshold to separate large jobs from shorter ones. Using both synthetic (Weibull) workloads and real Google data center traces, we demonstrate that our two-stage approach greatly improves upon the corresponding single-stage policies and closely approaches the performance of advanced size- and state-aware methods. Our results highlight that careful architectural design-rather than increased complexity at the dispatcher-can yield significantly better mean response times in large-scale computing environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage split with a service-time threshold lifts basic dispatchers like RR and JIQ toward size-aware performance on the tested traces, but the threshold choice looks tuned to those same traces.

read the letter

The main takeaway is that this paper demonstrates a practical architectural tweak: route short jobs and long jobs through separate instances of classical policies (RR, JIQ, LWL) using one fixed service-time threshold. On Weibull synthetics and Google traces the two-stage version beats the single-stage baselines and gets close to more complex size- and state-aware methods without needing full job-size or server-state info at every arrival.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage cluster architecture for job dispatching that augments classical policies (Round Robin, Join Idle Queue, Least Work Left) with an optimized service-time threshold to separate large jobs from shorter ones. Simulations on synthetic Weibull workloads and Google data center traces claim that the two-stage approach substantially improves mean response times over the corresponding single-stage policies and approaches the performance of advanced size- and state-aware methods.

Significance. If the service-time threshold can be selected in a workload-robust manner without requiring hindsight or full job-size information at dispatch time, the architectural split offers a practical compromise between dispatcher simplicity and performance. The evaluation on both synthetic and real traces is a positive feature; however, the lack of detail on threshold selection and statistical rigor limits the strength of the claimed gains.

major comments (2)

[Evaluation / Simulation Setup] The manuscript does not specify the exact procedure used to select the 'optimized service-time threshold' (mentioned in the abstract and evaluation). If this threshold is tuned via exhaustive search or distribution fitting on the same Weibull and Google traces used for final reporting, the reported improvements over single-stage policies and closeness to size-aware methods may be partly an artifact of workload-specific optimization rather than an intrinsic property of the two-stage split. This directly affects the central claim that the architecture works without full size or state knowledge at dispatch time.
[Results / Figures] Simulation results lack error bars, confidence intervals, or statistical significance tests for the reported mean response time improvements. Without these, it is difficult to assess whether the gains over single-stage RR/JIQ/LWL are reliable or sensitive to random seeds and trace variability.

minor comments (2)

[Section 3] Clarify in the architecture description whether the threshold is fixed after initial offline tuning or can be adapted online; the current wording leaves this ambiguous.
[Section 4] Add a brief discussion of how the two-stage split interacts with the specific dispatching policies (e.g., does the threshold affect only the first stage or both?).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Evaluation / Simulation Setup] The manuscript does not specify the exact procedure used to select the 'optimized service-time threshold' (mentioned in the abstract and evaluation). If this threshold is tuned via exhaustive search or distribution fitting on the same Weibull and Google traces used for final reporting, the reported improvements over single-stage policies and closeness to size-aware methods may be partly an artifact of workload-specific optimization rather than an intrinsic property of the two-stage split. This directly affects the central claim that the architecture works without full size or state knowledge at dispatch time.

Authors: We appreciate the referee raising this key issue. The threshold is selected offline once per workload class by optimizing mean response time over a discrete range of candidate values, using the known distribution parameters for synthetic Weibull workloads or aggregate statistics derived from the trace for the Google data. After this one-time selection, the dispatcher uses only the fixed threshold to route jobs to the appropriate stage and then applies the classical policy (RR, JIQ, or LWL) without any per-job size information or server-state knowledge beyond what the base policy requires. We agree the original manuscript omitted these procedural details and will add a dedicated subsection (including pseudocode and a sensitivity plot) in the revision to describe the selection process and demonstrate that nearby thresholds yield similar gains. This preserves the central claim that the architecture needs only coarse, static size awareness rather than full job-size or state information at runtime. revision: yes
Referee: [Results / Figures] Simulation results lack error bars, confidence intervals, or statistical significance tests for the reported mean response time improvements. Without these, it is difficult to assess whether the gains over single-stage RR/JIQ/LWL are reliable or sensitive to random seeds and trace variability.

Authors: We concur that the absence of variability measures weakens the presentation. In the revised manuscript we will re-generate all figures using 10 independent replications per configuration (different random seeds for synthetic workloads and different trace shuffles for the Google data) and add error bars showing the standard error of the mean. A short paragraph will also be added to the evaluation section stating the replication count and confirming that the observed improvements remain consistent across runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on independent external workloads

full rationale

The paper proposes a two-stage dispatching architecture combining classical policies (RR/JIQ/LWL) with a service-time threshold and validates performance via direct simulation on synthetic Weibull workloads and real Google cluster traces. No derivation chain, first-principles prediction, or equation is claimed that reduces by construction to a fitted quantity or self-citation. The threshold is presented as an optimized design parameter whose value is selected to demonstrate gains; results are measured outcomes on external data rather than tautological re-expression of inputs. This is the standard non-circular case for an empirical systems paper.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the existence and choice of an optimized service-time threshold used to classify jobs; this threshold is a free parameter whose value is selected to achieve the reported performance gains.

free parameters (1)

service-time threshold
Threshold value chosen or optimized to separate large jobs from short ones; its selection directly affects which dispatching policy applies to each job and therefore the measured response times.

pith-pipeline@v0.9.0 · 5674 in / 1173 out tokens · 55372 ms · 2026-05-22T16:11:06.919693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage cluster architecture that applies classical Round Robin, Join Idle Queue, and Least Work Left dispatching schemes, coupled with an optimized service-time threshold to separate large jobs from shorter ones
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using both synthetic (Weibull) workloads and real Google data center traces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action

M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action . Cambridge University Press, 2013

work page 2013
[2]

Designing Lo w- Complexity Heavy-Trafﬁc Delay-Optimal Load Balancing Sch emes: Theory to Algorithms,

X. Zhou, F. Wu, J. Tan, Y . Sun, and N. Shroff, “Designing Lo w- Complexity Heavy-Trafﬁc Delay-Optimal Load Balancing Sch emes: Theory to Algorithms,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 1, no. 2, Dec. 2017

work page 2017
[3]

A comprehensi ve survey for scheduling techniques in cloud computing,

M. Kumar, S. Sharma, A. Goel, and S. Singh, “A comprehensi ve survey for scheduling techniques in cloud computing,” Journal of Network and Computer Applications , vol. 143, pp. 1–33, 2019

work page 2019
[4]

A Survey on Scheduling Techniques in Computing and Network Convergence,

S. Tang, Y . Y u, H. Wang, G. Wang, W. Chen, Z. Xu, S. Guo, and W. Gao, “A Survey on Scheduling Techniques in Computing and Network Convergence,” IEEE Communications Surveys & Tutorials , vol. 26, no. 1, pp. 160–195, 2024

work page 2024
[5]

Open problems in queueing theory in spired by datacenter computing,

M. Harchol-Balter, “Open problems in queueing theory in spired by datacenter computing,” Queueing Syst. Theory Appl. , vol. 97, pp. 3– 37, feb 2021

work page 2021
[6]

On Sequential Dispatching Pol icies,

E. Hyytiä and R. Righter, “On Sequential Dispatching Pol icies,” in 2022 32nd International Telecommunication Networks and Ap plica- tions Conference (ITNAC) , 2022, pp. 1–6

work page 2022
[7]

Towards the optimal dynamic size-aware dispatchin g,

——, “Towards the optimal dynamic size-aware dispatchin g,” Perfor- mance Evaluation , vol. 164, p. 102396, 2024

work page 2024
[8]

Heavy-Trafﬁc Optimal S ize- and State-Aware Dispatching,

R. Xie, I. Grosof, and Z. Scully, “Heavy-Trafﬁc Optimal S ize- and State-Aware Dispatching,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 8, no. 1, Feb. 2024

work page 2024
[9]

cluster-data/Clusterdata2019,

J. Wilkes, “cluster-data/Clusterdata2019,” https://github.com/google/cluster-data/tree/master, 2019

work page 2019
[10]

Borg: the next generatio n,

M. Tirmazi, A. Barker, N. Deng, M. E. Haque, Z. G. Qin, S. H and, M. Harchol-Balter, and J. Wilkes, “Borg: the next generatio n,” in Pro- ceedings of the Fifteenth European Conference on Computer S ystems. Association for Computing Machinery, 2020

work page 2020
[11]

Trace-Based Workload Generation and Execution,

Y . Sfakianakis, E. Kanellou, M. Marazakis, and A. Bilas , “Trace-Based Workload Generation and Execution,” in Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceed ings. Springer-V erlag, 2021, p. 37–54

work page 2021
[12]

Data-Driven Workload Gener ation Based on Google Data Center Measurements,

M. Yildiz and A. Baiocchi, “Data-Driven Workload Gener ation Based on Google Data Center Measurements,” in 2024 IEEE 25th Inter- national Conference on High Performance Switching and Rout ing (HPSR), 2024, pp. 143–148

work page 2024
[13]

The Merit of Simple Policies: Buying Performance With Parallelism and System Architecture

M. Yildiz, A. Rolich, and A. Baiocchi, “The Merit of Simp le Policies: Buying Performance With Parallelism and System Architectu re,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16166

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Dispatching Odyssey: Exploring Performance in Computing Clusters under Real-world Workloads

——, “Dispatching Odyssey: Exploring Performance in Co mputing Clusters under Real-world Workloads,” 2025. [Online]. Ava ilable: https://arxiv.org/abs/2504.10184

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Task ass ignment in a distributed system: Improving performance by unbalancin g load,

M. Crovella, M. Harchol-Balter, and C. Murta, “Task ass ignment in a distributed system: Improving performance by unbalancin g load,” in Proceedings of the ACM SIGMETRICS Joint International Conf erence on Measurement and Modeling of Computer Systems , 1998, pp. 268– 269

work page 1998
[16]

Network analysis without exponen tiality assump- tions,

M. Harchol-Balter, “Network analysis without exponen tiality assump- tions,” Ph.D. dissertation, University of California at Be rkeley, 1996

work page 1996
[17]

Exploiting process l ifetime distri- butions for dynamic load balancing,

M. Harchol-Balter and A. Downey, “Exploiting process l ifetime distri- butions for dynamic load balancing,” in Proceedings of ACM SIGMET- RICS, 1996, pp. 13–24

work page 1996
[18]

The effect of heavy-tailed job siz e distributions on computer system design,

M. Harchol-Balter, “The effect of heavy-tailed job siz e distributions on computer system design,” in Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, E ngineering and Statistics , 1999

work page 1999
[19]

Size- based scheduling to improve web performance,

M. Harchol-Balter, B. Schroeder, N. Bansal, and M. Agra wal, “Size- based scheduling to improve web performance,” ACM Transactions on Computer Systems (TOCS) , vol. 21, no. 2, pp. 207–233, 2003

work page 2003
[20]

A Tale of the Tails: Power-Laws in Internet Measurements,

A. Mahanti, N. Carlsson, A. Mahanti, M. Arlitt, and C. Wi lliamson, “A Tale of the Tails: Power-Laws in Internet Measurements,” IEEE Network, vol. 27, no. 1, pp. 59–64, 2013

work page 2013

[1] [1]

Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action

M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action . Cambridge University Press, 2013

work page 2013

[2] [2]

Designing Lo w- Complexity Heavy-Trafﬁc Delay-Optimal Load Balancing Sch emes: Theory to Algorithms,

X. Zhou, F. Wu, J. Tan, Y . Sun, and N. Shroff, “Designing Lo w- Complexity Heavy-Trafﬁc Delay-Optimal Load Balancing Sch emes: Theory to Algorithms,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 1, no. 2, Dec. 2017

work page 2017

[3] [3]

A comprehensi ve survey for scheduling techniques in cloud computing,

M. Kumar, S. Sharma, A. Goel, and S. Singh, “A comprehensi ve survey for scheduling techniques in cloud computing,” Journal of Network and Computer Applications , vol. 143, pp. 1–33, 2019

work page 2019

[4] [4]

A Survey on Scheduling Techniques in Computing and Network Convergence,

S. Tang, Y . Y u, H. Wang, G. Wang, W. Chen, Z. Xu, S. Guo, and W. Gao, “A Survey on Scheduling Techniques in Computing and Network Convergence,” IEEE Communications Surveys & Tutorials , vol. 26, no. 1, pp. 160–195, 2024

work page 2024

[5] [5]

Open problems in queueing theory in spired by datacenter computing,

M. Harchol-Balter, “Open problems in queueing theory in spired by datacenter computing,” Queueing Syst. Theory Appl. , vol. 97, pp. 3– 37, feb 2021

work page 2021

[6] [6]

On Sequential Dispatching Pol icies,

E. Hyytiä and R. Righter, “On Sequential Dispatching Pol icies,” in 2022 32nd International Telecommunication Networks and Ap plica- tions Conference (ITNAC) , 2022, pp. 1–6

work page 2022

[7] [7]

Towards the optimal dynamic size-aware dispatchin g,

——, “Towards the optimal dynamic size-aware dispatchin g,” Perfor- mance Evaluation , vol. 164, p. 102396, 2024

work page 2024

[8] [8]

Heavy-Trafﬁc Optimal S ize- and State-Aware Dispatching,

R. Xie, I. Grosof, and Z. Scully, “Heavy-Trafﬁc Optimal S ize- and State-Aware Dispatching,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 8, no. 1, Feb. 2024

work page 2024

[9] [9]

cluster-data/Clusterdata2019,

J. Wilkes, “cluster-data/Clusterdata2019,” https://github.com/google/cluster-data/tree/master, 2019

work page 2019

[10] [10]

Borg: the next generatio n,

M. Tirmazi, A. Barker, N. Deng, M. E. Haque, Z. G. Qin, S. H and, M. Harchol-Balter, and J. Wilkes, “Borg: the next generatio n,” in Pro- ceedings of the Fifteenth European Conference on Computer S ystems. Association for Computing Machinery, 2020

work page 2020

[11] [11]

Trace-Based Workload Generation and Execution,

Y . Sfakianakis, E. Kanellou, M. Marazakis, and A. Bilas , “Trace-Based Workload Generation and Execution,” in Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceed ings. Springer-V erlag, 2021, p. 37–54

work page 2021

[12] [12]

Data-Driven Workload Gener ation Based on Google Data Center Measurements,

M. Yildiz and A. Baiocchi, “Data-Driven Workload Gener ation Based on Google Data Center Measurements,” in 2024 IEEE 25th Inter- national Conference on High Performance Switching and Rout ing (HPSR), 2024, pp. 143–148

work page 2024

[13] [13]

The Merit of Simple Policies: Buying Performance With Parallelism and System Architecture

M. Yildiz, A. Rolich, and A. Baiocchi, “The Merit of Simp le Policies: Buying Performance With Parallelism and System Architectu re,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16166

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Dispatching Odyssey: Exploring Performance in Computing Clusters under Real-world Workloads

——, “Dispatching Odyssey: Exploring Performance in Co mputing Clusters under Real-world Workloads,” 2025. [Online]. Ava ilable: https://arxiv.org/abs/2504.10184

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Task ass ignment in a distributed system: Improving performance by unbalancin g load,

M. Crovella, M. Harchol-Balter, and C. Murta, “Task ass ignment in a distributed system: Improving performance by unbalancin g load,” in Proceedings of the ACM SIGMETRICS Joint International Conf erence on Measurement and Modeling of Computer Systems , 1998, pp. 268– 269

work page 1998

[16] [16]

Network analysis without exponen tiality assump- tions,

M. Harchol-Balter, “Network analysis without exponen tiality assump- tions,” Ph.D. dissertation, University of California at Be rkeley, 1996

work page 1996

[17] [17]

Exploiting process l ifetime distri- butions for dynamic load balancing,

M. Harchol-Balter and A. Downey, “Exploiting process l ifetime distri- butions for dynamic load balancing,” in Proceedings of ACM SIGMET- RICS, 1996, pp. 13–24

work page 1996

[18] [18]

The effect of heavy-tailed job siz e distributions on computer system design,

M. Harchol-Balter, “The effect of heavy-tailed job siz e distributions on computer system design,” in Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, E ngineering and Statistics , 1999

work page 1999

[19] [19]

Size- based scheduling to improve web performance,

M. Harchol-Balter, B. Schroeder, N. Bansal, and M. Agra wal, “Size- based scheduling to improve web performance,” ACM Transactions on Computer Systems (TOCS) , vol. 21, no. 2, pp. 207–233, 2003

work page 2003

[20] [20]

A Tale of the Tails: Power-Laws in Internet Measurements,

A. Mahanti, N. Carlsson, A. Mahanti, M. Arlitt, and C. Wi lliamson, “A Tale of the Tails: Power-Laws in Internet Measurements,” IEEE Network, vol. 27, no. 1, pp. 59–64, 2013

work page 2013