On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

Anas Hattay; Eric Gascard; Fred Ngole Mboula; Zakaria Yahoun

arxiv: 2604.09202 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI

On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

Anas Hattay , Fred Ngole Mboula , Eric Gascard , Zakaria Yahoun This is my paper

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GNNdeep reinforcement learningcloud schedulingDAG topologyout-of-distributionenergy-aware schedulingworkflow schedulingmessage passing

0 comments

The pith

GNN-based schedulers for cloud DAG workflows degrade when training and deployment topologies differ

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a graph neural network combined with deep reinforcement learning for assigning tasks in workflow DAGs to cloud resources while minimizing completion time and energy use. It focuses on a single-workflow setting without queues and tests the scheduler under conditions where the input graphs differ from those seen during training. The central result is that these models lose effectiveness because structural differences in the DAGs interfere with how the GNN passes messages between nodes, which in turn harms the learned scheduling policy. A sympathetic reader would care because real cloud workloads present many different workflow shapes, so knowing exactly when the AI scheduler stops working reliably informs how to make such systems more dependable.

Core claim

We identify specific out-of-distribution conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.

What carries the argument

The GNN-based deep reinforcement learning scheduler that encodes the DAG topology via message passing to select task-to-resource assignments balancing completion time and energy consumption

If this is right

Schedulers trained on one collection of DAG structures will exhibit degraded performance on workflows whose topologies differ.
Disruptions to message passing inside the graph neural network directly cause the drop in policy quality.
Current GNN-based approaches carry fundamental limitations when the input graphs change between training and use.
More robust graph representations are required before these schedulers can be relied upon under distribution shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Cloud operators could improve reliability by training on a broader range of DAG shapes or by adding explicit topology-variation techniques during learning.
Similar sensitivity to graph structure may appear in other graph-based optimization tasks such as network routing or job-shop scheduling.
Alternative architectures that reduce dependence on exact edge patterns, such as those using learned invariants or set-based processing, could be tested as direct follow-ups.
Long-term, schedulers might incorporate online detection of topology shifts and trigger adaptation steps when a mismatch is observed.

Load-bearing premise

The controlled out-of-distribution conditions examined are representative of the distribution shifts that actually occur in real cloud scheduling deployments.

What would settle it

Deploy the trained GNN scheduler on real cloud workflow traces containing DAGs whose connectivity patterns, depths, or widths differ from the training set while keeping task counts and resource types comparable, then check whether completion time and energy metrics worsen substantially.

Figures

Figures reproduced from arXiv: 2604.09202 by Anas Hattay, Eric Gascard, Fred Ngole Mboula, Zakaria Yahoun.

**Figure 1.** Figure 1: Overview of the workflow scheduling problem in a heterogeneous cloud. The scheduler receives a DAG with ready tasks and a set of machines with different speed and power. It must decide which task runs on which machine, and each choice changes both completion time and energy. along paths. CPOP identifies a global critical path and assigns its tasks to a single processor to reduce inter processor communicati… view at source ↗

**Figure 2.** Figure 2: Schematic comparison of a wide DAG (left, shallow with many parallel branches) and a Long-CP DAG (right, deep dependency chain with limited side-branch concurrency) [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Fitness landscape of the scheduling search space. We exhaustively enumerate all feasible action sequences for a small scenario and project them onto a 2D plane using a Hilbert curve. The Z-axis shows energy consumption (lower is better). Left: LongCP DAG. Right: wide DAG . The dataset generator enforces queue-freedom: for each DAG we scale task memory and CPU requirements so that the peak-width layer fits … view at source ↗

**Figure 4.** Figure 4: Speed–power host regimes. Illustration of the four queue-free host regimes studied in this paper. Homogeneous-Speed: all VMs share the same processing speed but differ in power, isolating energy trade-offs at fixed makespan. Homogeneous-Power: all VMs share the same active power but differ in speed, isolating time trade-offs. Heterogeneous Aligned: speed and power are monotonically aligned, inducing an int… view at source ↗

**Figure 5.** Figure 5: Overview of the proposed GIN-based actor–critic scheduler architecture. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical attainment functions (EAFs) over 100 test jobs for the wide and LongCP specialists on [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Correlation between makespan and active energy across Pareto checkpoints for AL (top row) and [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Panels: a) Active energy, case NA; b) Active energy, case AL; c) Makespan, case NA; d) Makespan, [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: shows how this works. The x-axis shows a parallelism index (normalized ready tasks per level / DAG width) and the y-axis shows an energy intensity index (aggregate active power rate while busy) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GNN schedulers for DAG workflows flag topology-driven OOD failures but rest on claims without visible metrics or defined regimes.

read the letter

The punchline is that this work applies GNN-based deep reinforcement learning to energy-aware scheduling of DAG workflows and identifies topology-driven out-of-distribution failures, but the supporting data is not visible in the abstract. What stands out as new is the focused look at how structural mismatches between training and deployment DAGs affect policy generalization in this scheduling context. Prior work has used GNNs for scheduling, but adding a targeted OOD analysis on topology is a reasonable extension that could matter for real systems where workflows vary. The paper does well at laying out the setting clearly: heterogeneous resources, multiple objectives including energy, and the single-workflow case. It also gives a plausible mechanism—disrupted message passing—rather than just reporting that performance drops. The soft spots are more significant. The abstract asserts controlled evaluations and a principled explanation, but supplies no quantitative metrics, no precise definition of the OOD regimes, and no mention of statistical controls or baselines. This makes it difficult to assess how general the failure mode is or whether the tested conditions reflect actual cloud distribution shifts like varying arrival patterns or partial information. The weakest assumption here is that the examined mismatches represent real deployment changes, and without the full results it is hard to judge if that holds. This paper is for researchers interested in machine learning for cloud resource management and for those studying generalization in graph-based RL. A reader looking for practical insights on deploying learned schedulers would get value from the discussion of limitations, provided the experiments are fleshed out. It deserves a serious referee because the question is relevant to energy-sensitive infrastructure and the authors are honest about the challenges. I would recommend sending it for review, with the expectation that the authors add the missing numbers, clarify the OOD setups, and perhaps test against non-GNN alternatives to strengthen the case.

Referee Report

2 major / 0 minor

Summary. The manuscript studies a GNN-based deep reinforcement learning scheduler for single-workflow, queue-free assignment of heterogeneous resources to DAGs, with the goal of minimizing completion time and energy. It identifies specific OOD regimes and claims, via controlled evaluations, that performance degradation arises from structural mismatches (e.g., DAG topology, resource heterogeneity) that disrupt GNN message passing and thereby impair policy generalization; the work concludes by calling for more robust representations.

Significance. If the empirical results and causal attribution are substantiated with quantitative evidence, the paper would usefully document a concrete limitation of current GNN schedulers under distribution shift and motivate targeted improvements in representation learning for cloud scheduling. The focus on message-passing disruption offers a mechanistic angle that could inform follow-on work, though the absence of reported metrics currently limits assessment of practical impact.

major comments (2)

[Abstract / OOD evaluation section] Abstract and OOD evaluation section: the central claim that 'performance degradation stems from structural mismatches ... which disrupt message passing' is asserted but unsupported by any quantitative metrics, explicit definitions of the OOD regimes (changes in DAG topology, workflow characteristics, etc.), or statistical controls; without these the explanation remains unverified.
[Discussion] Discussion of real-world relevance: the attribution of failures to message-passing disruption and the call for robust representations rests on the assumption that the tested OOD conditions are representative of production distribution shifts; the manuscript does not address whether dynamic arrivals, partial observability, or multi-tenant interference produce comparable structural mismatches, weakening the practical implications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract / OOD evaluation section] Abstract and OOD evaluation section: the central claim that 'performance degradation stems from structural mismatches ... which disrupt message passing' is asserted but unsupported by any quantitative metrics, explicit definitions of the OOD regimes (changes in DAG topology, workflow characteristics, etc.), or statistical controls; without these the explanation remains unverified.

Authors: We agree that the original manuscript lacks explicit definitions and quantitative support for the claim regarding message-passing disruption. To address this, we will revise the OOD evaluation section to include formal definitions of the OOD regimes based on specific changes in DAG topology (such as variations in node count, edge connectivity, and depth) and workflow characteristics. Additionally, we will introduce quantitative metrics, including measures of embedding dissimilarity and message aggregation variance across GNN layers, along with statistical analysis using multiple runs and confidence intervals to substantiate the performance degradation and its link to structural mismatches. These additions will provide the necessary evidence for the central claim. revision: yes
Referee: [Discussion] Discussion of real-world relevance: the attribution of failures to message-passing disruption and the call for robust representations rests on the assumption that the tested OOD conditions are representative of production distribution shifts; the manuscript does not address whether dynamic arrivals, partial observability, or multi-tenant interference produce comparable structural mismatches, weakening the practical implications.

Authors: We acknowledge the limitation in scope. Our study is confined to the single-workflow, queue-free setting, and the OOD conditions tested are structural shifts within this controlled environment. In the revised discussion section, we will explicitly discuss the assumptions underlying our evaluation and note that factors like dynamic job arrivals, partial observability, and multi-tenant interference could introduce different types of mismatches not examined here. We will clarify that while our findings highlight the importance of robust representations for topology shifts, extending to those more complex scenarios remains future work. This will better contextualize the practical implications without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical OOD evaluations are self-contained

full rationale

The paper's core contribution rests on controlled empirical experiments that measure scheduler performance under specific out-of-distribution DAG topology and environment shifts. These evaluations directly observe degradation and attribute it to message-passing disruption without any equations, fitted parameters, or predictions that reduce to the training inputs by construction. No self-definitional loops, renamed known results, or load-bearing self-citations appear in the derivation; the analysis is falsifiable via the reported test regimes and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the chosen OOD test regimes are meaningful.

pith-pipeline@v0.9.0 · 5448 in / 993 out tokens · 28642 ms · 2026-05-10T18:07:31.961849+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAG topology and host regime shape what deep RL schedulers actually learn to prioritize

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

Karan Ajmera and T. K. Tewari. 2024. A systematic literature review on contemporary and future trends in virtual machine scheduling techniques in cloud and multi-access computing.Frontiers in Computer Science6 (2024). doi:10.3389/fcomp.2024.1288552

work page doi:10.3389/fcomp.2024.1288552 2024
[2]

Hamid Arabnejad and Jorge G. Barbosa. 2014. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table.IEEE Transactions on Parallel and Distributed Systems25, 3 (2014), 682–694

work page 2014
[3]

Anton Beloglazov and Rajkumar Buyya. 2012. Optimal Online Deterministic Algorithms and Adaptive Heuristics for Energy and Performance Efficient Dynamic Consolidation of Virtual Machines in Cloud Data Centers.Concurrency and Computation(2012)

work page 2012
[4]

Anton Beloglazov and Rajkumar Buyya. 2012. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers.Concurrency and Computation(2012)

work page 2012
[5]

Yoshua Bengio and Yann LeCun. 2007. Scaling Learning Algorithms Towards AI. InLarge Scale Kernel Machines. MIT Press

work page 2007
[6]

Luiz Fernando Bittencourt and Edmundo Roberto Mauro Madeira. 2018. HCOC: A cost optimization algorithm for workflow scheduling in hybrid clouds.IEEE Transactions on Cloud Computing6, 3 (2018), 649–662

work page 2018
[7]

Shaileshh Bojja, Hongzi Mao, Malte Schwarzkopf, and Mohammad Alizadeh. 2019. Learning Graph-based Cluster Scheduling Algorithms.ACM SIGCOMM Computer Communication Review(2019)

work page 2019
[8]

Rodrigo Calheiros et al. 2011. CloudSim: A toolkit for modeling and simulation of cloud computing environments. Software: Practice and Experience(2011)

work page 2011
[9]

Sunera Chandrasiri and Dulani Meedeniya. 2025. Energy-efficient dynamic workflow scheduling in cloud environ- ments using deep learning.Sensors25, 5 (2025), 1428

work page 2025
[10]

Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. InProceedings of the 26th Symposium on Operating Systems Principles. 153–167

work page 2017
[11]

da Silva and Luiz Fernando Bittencourt

Daniel G. da Silva and Luiz Fernando Bittencourt. 2017. Learning-based workflow scheduling in cloud computing: A survey.Journal of Cloud Computing6, 1 (2017), 1–20

work page 2017
[12]

Rafael Ferreira da Silva, Henri Casanova, Mats Rynge, Karan Vahi, Ewa Deelman, et al . 2020. A Community Framework for Enabling Scientific Workflow Research and Development.arXiv preprint arXiv:2009.00250(2020). https://arxiv.org/abs/2009.00250

work page arXiv 2020
[13]

Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. 2009. Workflows and e-Science: An overview of workflow system features and capabilities.Future Generation Computer Systems25, 5 (2009), 528–540

work page 2009
[14]

Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al

Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al. 2015. Pegasus, a Workflow Management System for Science. Future Generation Computer Systems46 (2015), 17–35. doi:10.1016/j.future.2014.10.008

work page doi:10.1016/j.future.2014.10.008 2015
[15]

Durillo and Radu Prodan

Juan J. Durillo and Radu Prodan. 2014. Multi-objective Workflow Scheduling in Amazon EC2. InCluster Computing, Vol. 17. 169–189. [from the Wiley survey] 2023)]EnergyWorkflowSurvey2023 [Authors from the Wiley survey]. 2023. A survey on energy- efficient workflow scheduling algorithms in cloud computing environment.Software: Practice and Experience (2023). ...

work page doi:10.1002/spe.3292 2014
[16]

2016.Deep learning

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016.Deep learning. Vol. 1. MIT Press

work page 2016
[17]

2016.Machine Learning Applications for Data Center Optimization

Google. 2016.Machine Learning Applications for Data Center Optimization. Technical Report. Google. https: //deepmind.google/discover/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40/ DeepMind AI Reduces Google Data Centre Cooling Bill by 40%

work page 2016
[18]

R. L. Graham et al. 1979. Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey. Annals of Discrete Mathematics(1979)

work page 1979
[19]

Zhaochen Gu, Sihai Tang, Beilei Jiang, Song Huang, Qiang Guan, and Song Fu. 2021. Characterizing Job-Task Dependency in Cloud Workloads Using Graph Learning. In2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 288–297. doi:10.1109/IPDPSW52791.2021.00052

work page doi:10.1109/ipdpsw52791.2021.00052 2021
[20]

Anas Hattay, Fred Ngole Mboula, Eric Gascard, and Zakaria Yahouni. 2024. Evaluating energy-aware cloud task scheduling techniques: A comprehensive dialectical approach. In2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC). IEEE, 109–118

work page 2024
[21]

Hinton, Simon Osindero, and Yee Whye Teh

Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation18 (2006), 1527–1554

work page 2006
[22]

Biao Hu, Xincheng Yang, and Mingguo Zhao. 2023. Online energy-efficient scheduling of DAG tasks on heteroge- neous embedded platforms.Journal of Systems Architecture140 (2023), 102894

work page 2023
[23]

2025.Data Centres and Data Transmission Networks

International Energy Agency. 2025.Data Centres and Data Transmission Networks. Technical Report. IEA. https://www.iea.org/energy-system/buildings/data-centres-and-data-transmission-networks

work page 2025
[24]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR

work page 2017
[25]

Konda and John N

Vijay R. Konda and John N. Tsitsiklis. 1999. Actor-Critic Algorithms. InAdvances in Neural Information Processing Systems, Vol. 12

work page 1999
[26]

Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. InProceedings of the 15th ACM Workshop on Hot Topics in Networks. 50–56

work page 2016
[27]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. InACM SIGCOMM. 270–288

work page 2019
[28]

Mingwei Mao and Marty Humphrey. 2012. A survey of dynamic resource management in cloud computing.IEEE Communications Surveys & Tutorials14, 4 (2012), 1101–1117

work page 2012
[29]

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. InProceedings of the 33rd International Conference on Machine Learning (ICML, Vol. 48). 1928–1937

work page 2016
[30]

Junyoung Park, Jaehyeong Chun, Sang Hun Kim, Youngkook Kim, and Jinkyoo Park. 2021. Learning to schedule job-shop problems: representation and policy learning using graph neural network and reinforcement learning. International journal of production research59, 11 (2021), 3360–3377

work page 2021
[31]

2012.Scheduling: Theory, Algorithms, and Systems

Michael Pinedo. 2012.Scheduling: Theory, Algorithms, and Systems. Springer

work page 2012
[32]

2005.The logic of scientific discovery

Karl Popper. 2005.The logic of scientific discovery. Routledge

work page 2005
[33]

Lownenthal, Bronis R

Barry Rountree, David K. Lownenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler Bletsch

work page
[34]

Adagio: Making DVS Practical for Complex HPC Applications.ACM International Conference on Supercom- puting(2009), 460–469

work page 2009
[35]

Rizos Sakellariou, Henan Zhao, and Ewa Deelman. 2010. Mapping workflows on grid resources: Experiments with the montage workflow. InGrids, P2P and services computing. Springer, 119–132

work page 2010
[36]

John Schulman et al. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In ICLR

work page 2016
[37]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In2020 USENIX Annual Technical Conference. 205–218

work page 2020
[39]

S&P Global Commodity Insights. 2025. Global data center power demand to double by 2030 on AI. https: //www.spglobal.com/energy/. Accessed: 2025-12-10

work page 2025
[40]

1998.Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. 1998.Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge

work page 1998
[41]

2014.Workflows for e-Science: Scientific Workflows for Grids

Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014.Workflows for e-Science: Scientific Workflows for Grids. Springer

work page 2014
[42]

Huangshi Tian, Yunchuan Zheng, and Wei Wang. 2019. Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud. InProceedings of the ACM Symposium on Cloud Computing (SoCC). doi:10. 1145/3357223.3362710

work page arXiv 2019
[43]

Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 2002. Performance-Effective and Low-Complexity Task Schedul- ing for Heterogeneous Computing.IEEE Transactions on Parallel and Distributed Systems13, 3 (2002), 260–274

work page 2002
[44]

Laurens Versluis and Alexandru Iosup. 2020. A Survey and Annotated Bibliography of Workflow Scheduling in Computing Infrastructures: Community, Keyword, and Article Reviews.arXiv preprint arXiv:2004.10077(2020)

work page arXiv 2020
[45]

Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022. Handling Distribution Shifts on Graphs: An Invariance Perspective. InInternational Conference on Learning Representations (ICLR)

work page 2022
[46]

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations (ICLR)

work page 2019
[47]

Kaixiang Zhang, Wenbo Li, Kai Zhang, Zenglin Li, and Zhaohui Qin. 2021. DeepJS: Job-shop scheduling with deep reinforcement learning.IEEE Transactions on Industrial Informatics17, 3 (2021), 2083–2093

work page 2021
[48]

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert da Fonseca. 2003. Per- formance assessment of multiobjective optimizers: An analysis and review.IEEE Transactions on Evolutionary Computation7, 2 (2003), 117–132

work page 2003
[49]

Zhiling Zong, Adam Manzanares, Xiaojun Ruan, and Xiao Qin. 2007. EAD and PEBD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters. 56, 12 (2007), 1661–1675

work page 2007

[1] [1]

Karan Ajmera and T. K. Tewari. 2024. A systematic literature review on contemporary and future trends in virtual machine scheduling techniques in cloud and multi-access computing.Frontiers in Computer Science6 (2024). doi:10.3389/fcomp.2024.1288552

work page doi:10.3389/fcomp.2024.1288552 2024

[2] [2]

Hamid Arabnejad and Jorge G. Barbosa. 2014. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table.IEEE Transactions on Parallel and Distributed Systems25, 3 (2014), 682–694

work page 2014

[3] [3]

Anton Beloglazov and Rajkumar Buyya. 2012. Optimal Online Deterministic Algorithms and Adaptive Heuristics for Energy and Performance Efficient Dynamic Consolidation of Virtual Machines in Cloud Data Centers.Concurrency and Computation(2012)

work page 2012

[4] [4]

Anton Beloglazov and Rajkumar Buyya. 2012. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers.Concurrency and Computation(2012)

work page 2012

[5] [5]

Yoshua Bengio and Yann LeCun. 2007. Scaling Learning Algorithms Towards AI. InLarge Scale Kernel Machines. MIT Press

work page 2007

[6] [6]

Luiz Fernando Bittencourt and Edmundo Roberto Mauro Madeira. 2018. HCOC: A cost optimization algorithm for workflow scheduling in hybrid clouds.IEEE Transactions on Cloud Computing6, 3 (2018), 649–662

work page 2018

[7] [7]

Shaileshh Bojja, Hongzi Mao, Malte Schwarzkopf, and Mohammad Alizadeh. 2019. Learning Graph-based Cluster Scheduling Algorithms.ACM SIGCOMM Computer Communication Review(2019)

work page 2019

[8] [8]

Rodrigo Calheiros et al. 2011. CloudSim: A toolkit for modeling and simulation of cloud computing environments. Software: Practice and Experience(2011)

work page 2011

[9] [9]

Sunera Chandrasiri and Dulani Meedeniya. 2025. Energy-efficient dynamic workflow scheduling in cloud environ- ments using deep learning.Sensors25, 5 (2025), 1428

work page 2025

[10] [10]

Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. InProceedings of the 26th Symposium on Operating Systems Principles. 153–167

work page 2017

[11] [11]

da Silva and Luiz Fernando Bittencourt

Daniel G. da Silva and Luiz Fernando Bittencourt. 2017. Learning-based workflow scheduling in cloud computing: A survey.Journal of Cloud Computing6, 1 (2017), 1–20

work page 2017

[12] [12]

Rafael Ferreira da Silva, Henri Casanova, Mats Rynge, Karan Vahi, Ewa Deelman, et al . 2020. A Community Framework for Enabling Scientific Workflow Research and Development.arXiv preprint arXiv:2009.00250(2020). https://arxiv.org/abs/2009.00250

work page arXiv 2020

[13] [13]

Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. 2009. Workflows and e-Science: An overview of workflow system features and capabilities.Future Generation Computer Systems25, 5 (2009), 528–540

work page 2009

[14] [14]

Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al

Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al. 2015. Pegasus, a Workflow Management System for Science. Future Generation Computer Systems46 (2015), 17–35. doi:10.1016/j.future.2014.10.008

work page doi:10.1016/j.future.2014.10.008 2015

[15] [15]

Durillo and Radu Prodan

Juan J. Durillo and Radu Prodan. 2014. Multi-objective Workflow Scheduling in Amazon EC2. InCluster Computing, Vol. 17. 169–189. [from the Wiley survey] 2023)]EnergyWorkflowSurvey2023 [Authors from the Wiley survey]. 2023. A survey on energy- efficient workflow scheduling algorithms in cloud computing environment.Software: Practice and Experience (2023). ...

work page doi:10.1002/spe.3292 2014

[16] [16]

2016.Deep learning

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016.Deep learning. Vol. 1. MIT Press

work page 2016

[17] [17]

2016.Machine Learning Applications for Data Center Optimization

Google. 2016.Machine Learning Applications for Data Center Optimization. Technical Report. Google. https: //deepmind.google/discover/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40/ DeepMind AI Reduces Google Data Centre Cooling Bill by 40%

work page 2016

[18] [18]

R. L. Graham et al. 1979. Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey. Annals of Discrete Mathematics(1979)

work page 1979

[19] [19]

Zhaochen Gu, Sihai Tang, Beilei Jiang, Song Huang, Qiang Guan, and Song Fu. 2021. Characterizing Job-Task Dependency in Cloud Workloads Using Graph Learning. In2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 288–297. doi:10.1109/IPDPSW52791.2021.00052

work page doi:10.1109/ipdpsw52791.2021.00052 2021

[20] [20]

Anas Hattay, Fred Ngole Mboula, Eric Gascard, and Zakaria Yahouni. 2024. Evaluating energy-aware cloud task scheduling techniques: A comprehensive dialectical approach. In2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC). IEEE, 109–118

work page 2024

[21] [21]

Hinton, Simon Osindero, and Yee Whye Teh

Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation18 (2006), 1527–1554

work page 2006

[22] [22]

Biao Hu, Xincheng Yang, and Mingguo Zhao. 2023. Online energy-efficient scheduling of DAG tasks on heteroge- neous embedded platforms.Journal of Systems Architecture140 (2023), 102894

work page 2023

[23] [23]

2025.Data Centres and Data Transmission Networks

International Energy Agency. 2025.Data Centres and Data Transmission Networks. Technical Report. IEA. https://www.iea.org/energy-system/buildings/data-centres-and-data-transmission-networks

work page 2025

[24] [24]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR

work page 2017

[25] [25]

Konda and John N

Vijay R. Konda and John N. Tsitsiklis. 1999. Actor-Critic Algorithms. InAdvances in Neural Information Processing Systems, Vol. 12

work page 1999

[26] [26]

Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. InProceedings of the 15th ACM Workshop on Hot Topics in Networks. 50–56

work page 2016

[27] [27]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. InACM SIGCOMM. 270–288

work page 2019

[28] [28]

Mingwei Mao and Marty Humphrey. 2012. A survey of dynamic resource management in cloud computing.IEEE Communications Surveys & Tutorials14, 4 (2012), 1101–1117

work page 2012

[29] [29]

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. InProceedings of the 33rd International Conference on Machine Learning (ICML, Vol. 48). 1928–1937

work page 2016

[30] [30]

Junyoung Park, Jaehyeong Chun, Sang Hun Kim, Youngkook Kim, and Jinkyoo Park. 2021. Learning to schedule job-shop problems: representation and policy learning using graph neural network and reinforcement learning. International journal of production research59, 11 (2021), 3360–3377

work page 2021

[31] [31]

2012.Scheduling: Theory, Algorithms, and Systems

Michael Pinedo. 2012.Scheduling: Theory, Algorithms, and Systems. Springer

work page 2012

[32] [32]

2005.The logic of scientific discovery

Karl Popper. 2005.The logic of scientific discovery. Routledge

work page 2005

[33] [33]

Lownenthal, Bronis R

Barry Rountree, David K. Lownenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler Bletsch

work page

[34] [34]

Adagio: Making DVS Practical for Complex HPC Applications.ACM International Conference on Supercom- puting(2009), 460–469

work page 2009

[35] [35]

Rizos Sakellariou, Henan Zhao, and Ewa Deelman. 2010. Mapping workflows on grid resources: Experiments with the montage workflow. InGrids, P2P and services computing. Springer, 119–132

work page 2010

[36] [36]

John Schulman et al. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In ICLR

work page 2016

[37] [37]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In2020 USENIX Annual Technical Conference. 205–218

work page 2020

[39] [39]

S&P Global Commodity Insights. 2025. Global data center power demand to double by 2030 on AI. https: //www.spglobal.com/energy/. Accessed: 2025-12-10

work page 2025

[40] [40]

1998.Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. 1998.Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge

work page 1998

[41] [41]

2014.Workflows for e-Science: Scientific Workflows for Grids

Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014.Workflows for e-Science: Scientific Workflows for Grids. Springer

work page 2014

[42] [42]

Huangshi Tian, Yunchuan Zheng, and Wei Wang. 2019. Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud. InProceedings of the ACM Symposium on Cloud Computing (SoCC). doi:10. 1145/3357223.3362710

work page arXiv 2019

[43] [43]

Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 2002. Performance-Effective and Low-Complexity Task Schedul- ing for Heterogeneous Computing.IEEE Transactions on Parallel and Distributed Systems13, 3 (2002), 260–274

work page 2002

[44] [44]

Laurens Versluis and Alexandru Iosup. 2020. A Survey and Annotated Bibliography of Workflow Scheduling in Computing Infrastructures: Community, Keyword, and Article Reviews.arXiv preprint arXiv:2004.10077(2020)

work page arXiv 2020

[45] [45]

Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022. Handling Distribution Shifts on Graphs: An Invariance Perspective. InInternational Conference on Learning Representations (ICLR)

work page 2022

[46] [46]

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations (ICLR)

work page 2019

[47] [47]

Kaixiang Zhang, Wenbo Li, Kai Zhang, Zenglin Li, and Zhaohui Qin. 2021. DeepJS: Job-shop scheduling with deep reinforcement learning.IEEE Transactions on Industrial Informatics17, 3 (2021), 2083–2093

work page 2021

[48] [48]

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert da Fonseca. 2003. Per- formance assessment of multiobjective optimizers: An analysis and review.IEEE Transactions on Evolutionary Computation7, 2 (2003), 117–132

work page 2003

[49] [49]

Zhiling Zong, Adam Manzanares, Xiaojun Ruan, and Xiao Qin. 2007. EAD and PEBD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters. 56, 12 (2007), 1661–1675

work page 2007