Duration-Informed Workload Scheduler

Andrea Borghesi; Daniela Loreti; Davide Leone

arxiv: 2604.09599 · v1 · submitted 2026-03-07 · 💻 cs.DC · cs.AI

Duration-Informed Workload Scheduler

Daniela Loreti , Davide Leone , Andrea Borghesi This is my paper

Pith reviewed 2026-05-15 14:45 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords workload schedulerjob duration predictionmachine learninghigh-performance computingwaiting timesupercomputer tracesscheduling optimization

0 comments

The pith

A machine learning module that predicts job runtimes lets a workload scheduler cut average waiting time by 11 percent on real supercomputer traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to improve a standard workload scheduler by adding a machine learning model that forecasts how long each submitted job will take. With those forecasts available before any job starts, the scheduler can order jobs more effectively and avoid blocking short jobs behind long ones. On traces from a Tier-0 supercomputer the enhanced scheduler lowers the mean waiting time across all jobs by roughly 11 percent. Shorter waits translate directly into faster turnaround for the system and better service for users who submit jobs. The approach relies on using only information available at job submission time to drive the predictions.

Core claim

We devise a workload scheduler enhanced with a duration prediction module built via Machine Learning. We evaluate its effectiveness and show its performance using workload traces from a Tier-0 supercomputer, demonstrating a decrease in mean waiting time across all jobs of around 11%. Lower waiting times are directly connected to better quality of service from the users' point of view and higher turnaround from the system's perspective.

What carries the argument

Machine-learning duration prediction module that converts submitted job metadata into runtime forecasts and feeds those forecasts into the scheduler's decision process.

If this is right

Users experience shorter average waits before their jobs start.
The computing facility completes the same workload in less wall-clock time.
The method works on production traces without requiring changes to how users submit jobs.
The gain appears across the entire job mix rather than only for a subset of jobs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-plus-scheduling pattern could be tested on cloud or cluster workloads that also lack reliable runtime estimates.
If prediction error varies with job size, the scheduler may need a simple fallback rule for low-confidence forecasts.
Combining duration forecasts with other submission-time features such as requested cores or memory could produce further reductions in waiting time.

Load-bearing premise

Machine learning can produce sufficiently accurate job-duration forecasts from submission-time data alone and these forecasts can be used by the scheduler without adding new overhead that cancels the gains.

What would settle it

Running the same scheduler on the same traces but with randomized or zero-accuracy duration predictions; if mean waiting time does not rise back to the baseline level, the claimed benefit from duration information is not supported.

Figures

Figures reproduced from arXiv: 2604.09599 by Andrea Borghesi, Daniela Loreti, Davide Leone.

**Figure 1.** Figure 1: Histogram of the target variable run_time an accurate elaboration of a two-years-long data collection [4] from a production supercomputer: MARCONI100 hosted by the HPC center CINECA1 . The considered dataset consists of 628.977 elements and a set of submission time features for each job, which are described in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Scheduling performance comparison scheduler (1457 vs 1241). This is a direct consequence of the very constrained testing environment and–as already pointed out for Setup A–the better capacity of DIWS to estimate the durations, highlighting the huge difference existing between jobs. 5 Conclusion As ML techniques have shown promising results in several scientific fields, we propose to apply analogous methods… view at source ↗

read the original abstract

High-performance computing systems are complex machines whose behaviour is governed by the correct functioning of its many subsystems. Among these, the workload scheduler has a crucial impact on the timely execution of the jobs continuously submitted to the computing resources. Making high-quality scheduling decisions is contingent on knowing the duration of submitted jobs before their execution--a non-trivial task for users that can be tackled with Machine Learning. In this work, we devise a workload scheduler enhanced with a duration prediction module built via Machine Learning. We evaluate its effectiveness and show its performance using workload traces from a Tier-0 supercomputer, demonstrating a decrease in mean waiting time across all jobs of around 11%. Lower waiting times are directly connected to better quality of service from the users' point of view and higher turnaround from the system's perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an ML duration predictor to an HPC scheduler and claims an 11% drop in mean waiting time on Tier-0 traces, but leaves the model, features, and integration steps unspecified.

read the letter

The main point is that the authors train a machine learning model on submitted job metadata to predict run times, then use those predictions inside the scheduler instead of user estimates or basic policies. On real traces from a large supercomputer they report about 11% lower average waiting time across jobs. This is a direct, practical extension of predictive scheduling ideas that already exist in the HPC literature. The evaluation on actual Tier-0 logs is the part that stands out; it gives the result more weight than synthetic benchmarks would. The connection to user-visible waiting time and system turnaround is also the right metric to track. The soft spots are exactly where the stress-test note points. The text does not describe the feature set fed to the model, the algorithm used, the training and validation procedure, the prediction error distribution, or the precise rule change inside the scheduler. Without those pieces it is hard to tell whether the 11% gain survives realistic noise or is tied to the particular trace and simulator. The assumption that accurate predictions can be made from pre-execution metadata is stated but not stress-tested in the provided description. This is the sort of work that belongs in a reading group focused on systems scheduling. Readers who manage or simulate large clusters would find the trace-driven result useful even if they have to ask for the missing details. The paper deserves peer review so referees can request the model architecture, baselines, and statistical checks that are currently absent.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a workload scheduler for high-performance computing systems that augments standard scheduling with a machine-learning module to predict job durations from submitted metadata. Evaluated on traces from a Tier-0 supercomputer, the approach is reported to reduce mean waiting time across all jobs by approximately 11%.

Significance. If the prediction accuracy and integration details hold under scrutiny, the work could offer a practical way to improve quality of service and system turnaround in production HPC schedulers by replacing or augmenting heuristics such as FCFS or SJF with duration-informed decisions. The use of real Tier-0 traces is a positive empirical anchor.

major comments (2)

[Abstract] Abstract: the central performance claim of an 11% reduction in mean waiting time is presented without any description of the ML model architecture, feature set, training/validation procedure, prediction-error distribution, baseline schedulers, or statistical significance testing; these omissions make it impossible to determine whether the reported gain is robust to realistic prediction noise or an artifact of the specific trace and simulator.
[Evaluation] Evaluation section (implied by the abstract claim): the manuscript does not specify how predicted durations are incorporated into the scheduling policy (e.g., whether they replace actual durations in SJF, are used as weights in a priority queue, or trigger backfilling), nor does it report overhead measurements for the prediction module itself; without these details the 11% figure cannot be reproduced or stress-tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript lacks sufficient detail on the ML model and scheduling integration to support the reported performance claim. We will revise the manuscript to address these omissions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim of an 11% reduction in mean waiting time is presented without any description of the ML model architecture, feature set, training/validation procedure, prediction-error distribution, baseline schedulers, or statistical significance testing; these omissions make it impossible to determine whether the reported gain is robust to realistic prediction noise or an artifact of the specific trace and simulator.

Authors: We agree that the abstract is overly concise and omits critical details needed to assess robustness. In the revised manuscript we will expand the abstract to briefly describe the ML model as a regression-based predictor, the feature set drawn from submitted job metadata, the training/validation split used, the observed prediction error distribution, the FCFS and SJF baselines, and the statistical significance of the 11% waiting-time reduction. Full technical specifications will remain in the body but will be cross-referenced from the abstract. revision: yes
Referee: [Evaluation] Evaluation section (implied by the abstract claim): the manuscript does not specify how predicted durations are incorporated into the scheduling policy (e.g., whether they replace actual durations in SJF, are used as weights in a priority queue, or trigger backfilling), nor does it report overhead measurements for the prediction module itself; without these details the 11% figure cannot be reproduced or stress-tested.

Authors: We agree that the integration mechanism and overhead are not described. The revised manuscript will explicitly state that predicted durations replace actual durations inside a shortest-job-first policy augmented with conservative backfilling. We will add pseudocode for the combined scheduler and report measured prediction overhead (sub-millisecond per job on the evaluation hardware). These additions will enable reproduction and sensitivity analysis under realistic prediction noise. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML prediction evaluated on traces

full rationale

The derivation consists of training an ML model on job metadata to predict durations, integrating the predictions into a scheduler policy, and measuring mean waiting time reduction via simulation on held-out workload traces. None of the load-bearing steps reduce by construction to the inputs, rely on self-citation for uniqueness, or rename a fitted quantity as a prediction. The reported 11% improvement is an external empirical outcome on Tier-0 traces rather than a self-referential or definitional result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that ML models trained on historical workload data can produce sufficiently accurate duration predictions to improve scheduling decisions. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Machine learning models can predict job durations from historical workload data with useful accuracy
This is the core premise enabling the duration-informed scheduler.

pith-pipeline@v0.9.0 · 5423 in / 1114 out tokens · 35135 ms · 2026-05-15T14:45:25.902730+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Journal of Network and Computer Applications 82, 93–113 (2017) 12 Daniela Loreti et al

Amiri, M., Mohammad-Khanli, L.: Survey on prediction models of applications for resources provisioning in cloud. Journal of Network and Computer Applications 82, 93–113 (2017) 12 Daniela Loreti et al

work page 2017
[2]

Antici, F., Ardebili, M.S., et al.: PM100: A job power consumption dataset of a large-scale production HPC system. In: Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023, Denver, CO, USA, November 12-17, 2023. pp. 1812–

work page 2023
[3]

Bengio, Y., Goodfellow, I., Courville, A.: Deep learning, vol. 1. MIT press Mas- sachusetts, USA: (2017)

work page 2017
[4]

Scientific Data 10(1), 288 (2023)

Borghesi, A., Di Santi, C., Molan, M., Ardebili, M.S., Mauri, A., Guarrasi, M., Galetti, D., Cestari, M., Barchi, F., Benini, L., et al.: M100 exadata: a data col- lection campaign on the cineca’s marconi100 tier-0 supercomputer. Scientific Data 10(1), 288 (2023)

work page 2023
[5]

Machine learning45(1), 5–32 (2001)

Breiman, L.: Random forests. Machine learning45(1), 5–32 (2001)

work page 2001
[6]

CRC press (1984)

Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC press (1984)

work page 1984
[7]

In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization

Cirne, W., Berman, F.: A comprehensive model of the supercomputer work- load. In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538). pp. 140–148 (2001). https://doi.org/10.1109/WWC.2001.990753

work page doi:10.1109/wwc.2001.990753 2001
[8]

1–25 (2024)

De Filippo, A., Di Giacomo, E., Borghesi, A.: Machine learning approaches to predicttheexecutiontimeofthemeteorologicalsimulationsoftwarecosmo.Journal of Intelligent Information Systems pp. 1–25 (2024)

work page 2024
[9]

In: 20th Workshop on Job Scheduling Strategies for Parallel Processing

Dutot, P.F., Mercier, M., et al.: Batsim: a Realistic Language-Independent Re- sources and Jobs Management Systems Simulator. In: 20th Workshop on Job Scheduling Strategies for Parallel Processing. Chicago, United States (May 2016)

work page 2016
[10]

An- nals of statistics pp

Friedman, J.H.: Greedy function approximation: a gradient boosting machine. An- nals of statistics pp. 1189–1232 (2001)

work page 2001
[11]

Artificial Intelligence206, 79–111 (2014)

Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence206, 79–111 (2014)

work page 2014
[12]

Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., Hu, C.: OKCM: improving parallel task scheduling in high-performance computing systems using online learning. J. Su- percomput.77(6), 5960–5983 (2021). https://doi.org/10.1007/S11227-020-03506- 5, https://doi.org/10.1007/s11227-020-03506-5

work page doi:10.1007/s11227-020-03506- 2021
[13]

In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Miu, T., Missier, P.: Predicting the execution time of workflow activities based on their input features. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. pp. 64–72. IEEE (2012)

work page 2012
[14]

Cluster Computing20(3), 2805–2819 (2017)

Nadeem, F., Alghazzawi, D., et al.: Modeling and predicting execution time of scientific workflows in the grid using radial basis function neural network. Cluster Computing20(3), 2805–2819 (2017)

work page 2017
[15]

In: Proceedings of the Platform for Advanced Scientific Computing Conference

Pittino, F., Bonfà, P., Bartolini, A., Affinito, F., Benini, L., Cavazzoni, C.: Pre- diction of time-to-solution in material science simulations using deep learning. In: Proceedings of the Platform for Advanced Scientific Computing Conference. pp. 1– 9 (2019)

work page 2019
[16]

In: Furlani, T.R

Tanash, M., Dunn, B., ry al.: Improving HPC system performance by predicting job resources via supervised machine learning. In: Furlani, T.R. (ed.) Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, Chicago, IL, USA, July 28 - August 01, 2019. pp. 69:1–69:8. ACM (2019)

work page 2019
[17]

In: 2007 IEEE International Conference on Cluster Com- puting

Wong, A.K., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: 2007 IEEE International Conference on Cluster Com- puting. pp. 64–73. IEEE (2007)

work page 2007

[1] [1]

Journal of Network and Computer Applications 82, 93–113 (2017) 12 Daniela Loreti et al

Amiri, M., Mohammad-Khanli, L.: Survey on prediction models of applications for resources provisioning in cloud. Journal of Network and Computer Applications 82, 93–113 (2017) 12 Daniela Loreti et al

work page 2017

[2] [2]

Antici, F., Ardebili, M.S., et al.: PM100: A job power consumption dataset of a large-scale production HPC system. In: Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023, Denver, CO, USA, November 12-17, 2023. pp. 1812–

work page 2023

[3] [3]

Bengio, Y., Goodfellow, I., Courville, A.: Deep learning, vol. 1. MIT press Mas- sachusetts, USA: (2017)

work page 2017

[4] [4]

Scientific Data 10(1), 288 (2023)

Borghesi, A., Di Santi, C., Molan, M., Ardebili, M.S., Mauri, A., Guarrasi, M., Galetti, D., Cestari, M., Barchi, F., Benini, L., et al.: M100 exadata: a data col- lection campaign on the cineca’s marconi100 tier-0 supercomputer. Scientific Data 10(1), 288 (2023)

work page 2023

[5] [5]

Machine learning45(1), 5–32 (2001)

Breiman, L.: Random forests. Machine learning45(1), 5–32 (2001)

work page 2001

[6] [6]

CRC press (1984)

Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC press (1984)

work page 1984

[7] [7]

In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization

Cirne, W., Berman, F.: A comprehensive model of the supercomputer work- load. In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538). pp. 140–148 (2001). https://doi.org/10.1109/WWC.2001.990753

work page doi:10.1109/wwc.2001.990753 2001

[8] [8]

1–25 (2024)

De Filippo, A., Di Giacomo, E., Borghesi, A.: Machine learning approaches to predicttheexecutiontimeofthemeteorologicalsimulationsoftwarecosmo.Journal of Intelligent Information Systems pp. 1–25 (2024)

work page 2024

[9] [9]

In: 20th Workshop on Job Scheduling Strategies for Parallel Processing

Dutot, P.F., Mercier, M., et al.: Batsim: a Realistic Language-Independent Re- sources and Jobs Management Systems Simulator. In: 20th Workshop on Job Scheduling Strategies for Parallel Processing. Chicago, United States (May 2016)

work page 2016

[10] [10]

An- nals of statistics pp

Friedman, J.H.: Greedy function approximation: a gradient boosting machine. An- nals of statistics pp. 1189–1232 (2001)

work page 2001

[11] [11]

Artificial Intelligence206, 79–111 (2014)

Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence206, 79–111 (2014)

work page 2014

[12] [12]

Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., Hu, C.: OKCM: improving parallel task scheduling in high-performance computing systems using online learning. J. Su- percomput.77(6), 5960–5983 (2021). https://doi.org/10.1007/S11227-020-03506- 5, https://doi.org/10.1007/s11227-020-03506-5

work page doi:10.1007/s11227-020-03506- 2021

[13] [13]

In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Miu, T., Missier, P.: Predicting the execution time of workflow activities based on their input features. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. pp. 64–72. IEEE (2012)

work page 2012

[14] [14]

Cluster Computing20(3), 2805–2819 (2017)

Nadeem, F., Alghazzawi, D., et al.: Modeling and predicting execution time of scientific workflows in the grid using radial basis function neural network. Cluster Computing20(3), 2805–2819 (2017)

work page 2017

[15] [15]

In: Proceedings of the Platform for Advanced Scientific Computing Conference

Pittino, F., Bonfà, P., Bartolini, A., Affinito, F., Benini, L., Cavazzoni, C.: Pre- diction of time-to-solution in material science simulations using deep learning. In: Proceedings of the Platform for Advanced Scientific Computing Conference. pp. 1– 9 (2019)

work page 2019

[16] [16]

In: Furlani, T.R

Tanash, M., Dunn, B., ry al.: Improving HPC system performance by predicting job resources via supervised machine learning. In: Furlani, T.R. (ed.) Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, Chicago, IL, USA, July 28 - August 01, 2019. pp. 69:1–69:8. ACM (2019)

work page 2019

[17] [17]

In: 2007 IEEE International Conference on Cluster Com- puting

Wong, A.K., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: 2007 IEEE International Conference on Cluster Com- puting. pp. 64–73. IEEE (2007)

work page 2007