pith. sign in

arxiv: 2604.09202 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI

On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords GNNdeep reinforcement learningcloud schedulingDAG topologyout-of-distributionenergy-aware schedulingworkflow schedulingmessage passing
0
0 comments X

The pith

GNN-based schedulers for cloud DAG workflows degrade when training and deployment topologies differ

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a graph neural network combined with deep reinforcement learning for assigning tasks in workflow DAGs to cloud resources while minimizing completion time and energy use. It focuses on a single-workflow setting without queues and tests the scheduler under conditions where the input graphs differ from those seen during training. The central result is that these models lose effectiveness because structural differences in the DAGs interfere with how the GNN passes messages between nodes, which in turn harms the learned scheduling policy. A sympathetic reader would care because real cloud workloads present many different workflow shapes, so knowing exactly when the AI scheduler stops working reliably informs how to make such systems more dependable.

Core claim

We identify specific out-of-distribution conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.

What carries the argument

The GNN-based deep reinforcement learning scheduler that encodes the DAG topology via message passing to select task-to-resource assignments balancing completion time and energy consumption

If this is right

  • Schedulers trained on one collection of DAG structures will exhibit degraded performance on workflows whose topologies differ.
  • Disruptions to message passing inside the graph neural network directly cause the drop in policy quality.
  • Current GNN-based approaches carry fundamental limitations when the input graphs change between training and use.
  • More robust graph representations are required before these schedulers can be relied upon under distribution shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Cloud operators could improve reliability by training on a broader range of DAG shapes or by adding explicit topology-variation techniques during learning.
  • Similar sensitivity to graph structure may appear in other graph-based optimization tasks such as network routing or job-shop scheduling.
  • Alternative architectures that reduce dependence on exact edge patterns, such as those using learned invariants or set-based processing, could be tested as direct follow-ups.
  • Long-term, schedulers might incorporate online detection of topology shifts and trigger adaptation steps when a mismatch is observed.

Load-bearing premise

The controlled out-of-distribution conditions examined are representative of the distribution shifts that actually occur in real cloud scheduling deployments.

What would settle it

Deploy the trained GNN scheduler on real cloud workflow traces containing DAGs whose connectivity patterns, depths, or widths differ from the training set while keeping task counts and resource types comparable, then check whether completion time and energy metrics worsen substantially.

Figures

Figures reproduced from arXiv: 2604.09202 by Anas Hattay, Eric Gascard, Fred Ngole Mboula, Zakaria Yahoun.

Figure 1
Figure 1. Figure 1: Overview of the workflow scheduling problem in a heterogeneous cloud. The scheduler receives a DAG with ready tasks and a set of machines with different speed and power. It must decide which task runs on which machine, and each choice changes both completion time and energy. along paths. CPOP identifies a global critical path and assigns its tasks to a single processor to reduce inter processor communicati… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic comparison of a wide DAG (left, shallow with many parallel branches) and a Long-CP DAG (right, deep dependency chain with limited side-branch concurrency) [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fitness landscape of the scheduling search space. We exhaustively enumerate all feasible action sequences for a small scenario and project them onto a 2D plane using a Hilbert curve. The Z-axis shows energy consumption (lower is better). Left: LongCP DAG. Right: wide DAG . The dataset generator enforces queue-freedom: for each DAG we scale task memory and CPU requirements so that the peak-width layer fits … view at source ↗
Figure 4
Figure 4. Figure 4: Speed–power host regimes. Illustration of the four queue-free host regimes studied in this paper. Homogeneous-Speed: all VMs share the same processing speed but differ in power, isolating energy trade-offs at fixed makespan. Homogeneous-Power: all VMs share the same active power but differ in speed, isolating time trade-offs. Heterogeneous Aligned: speed and power are monotonically aligned, inducing an int… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the proposed GIN-based actor–critic scheduler architecture. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical attainment functions (EAFs) over 100 test jobs for the wide and LongCP specialists on [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correlation between makespan and active energy across Pareto checkpoints for AL (top row) and [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Panels: a) Active energy, case NA; b) Active energy, case AL; c) Makespan, case NA; d) Makespan, [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: shows how this works. The x-axis shows a parallelism index (normalized ready tasks per level / DAG width) and the y-axis shows an energy intensity index (aggregate active power rate while busy) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript studies a GNN-based deep reinforcement learning scheduler for single-workflow, queue-free assignment of heterogeneous resources to DAGs, with the goal of minimizing completion time and energy. It identifies specific OOD regimes and claims, via controlled evaluations, that performance degradation arises from structural mismatches (e.g., DAG topology, resource heterogeneity) that disrupt GNN message passing and thereby impair policy generalization; the work concludes by calling for more robust representations.

Significance. If the empirical results and causal attribution are substantiated with quantitative evidence, the paper would usefully document a concrete limitation of current GNN schedulers under distribution shift and motivate targeted improvements in representation learning for cloud scheduling. The focus on message-passing disruption offers a mechanistic angle that could inform follow-on work, though the absence of reported metrics currently limits assessment of practical impact.

major comments (2)
  1. [Abstract / OOD evaluation section] Abstract and OOD evaluation section: the central claim that 'performance degradation stems from structural mismatches ... which disrupt message passing' is asserted but unsupported by any quantitative metrics, explicit definitions of the OOD regimes (changes in DAG topology, workflow characteristics, etc.), or statistical controls; without these the explanation remains unverified.
  2. [Discussion] Discussion of real-world relevance: the attribution of failures to message-passing disruption and the call for robust representations rests on the assumption that the tested OOD conditions are representative of production distribution shifts; the manuscript does not address whether dynamic arrivals, partial observability, or multi-tenant interference produce comparable structural mismatches, weakening the practical implications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / OOD evaluation section] Abstract and OOD evaluation section: the central claim that 'performance degradation stems from structural mismatches ... which disrupt message passing' is asserted but unsupported by any quantitative metrics, explicit definitions of the OOD regimes (changes in DAG topology, workflow characteristics, etc.), or statistical controls; without these the explanation remains unverified.

    Authors: We agree that the original manuscript lacks explicit definitions and quantitative support for the claim regarding message-passing disruption. To address this, we will revise the OOD evaluation section to include formal definitions of the OOD regimes based on specific changes in DAG topology (such as variations in node count, edge connectivity, and depth) and workflow characteristics. Additionally, we will introduce quantitative metrics, including measures of embedding dissimilarity and message aggregation variance across GNN layers, along with statistical analysis using multiple runs and confidence intervals to substantiate the performance degradation and its link to structural mismatches. These additions will provide the necessary evidence for the central claim. revision: yes

  2. Referee: [Discussion] Discussion of real-world relevance: the attribution of failures to message-passing disruption and the call for robust representations rests on the assumption that the tested OOD conditions are representative of production distribution shifts; the manuscript does not address whether dynamic arrivals, partial observability, or multi-tenant interference produce comparable structural mismatches, weakening the practical implications.

    Authors: We acknowledge the limitation in scope. Our study is confined to the single-workflow, queue-free setting, and the OOD conditions tested are structural shifts within this controlled environment. In the revised discussion section, we will explicitly discuss the assumptions underlying our evaluation and note that factors like dynamic job arrivals, partial observability, and multi-tenant interference could introduce different types of mismatches not examined here. We will clarify that while our findings highlight the importance of robust representations for topology shifts, extending to those more complex scenarios remains future work. This will better contextualize the practical implications without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical OOD evaluations are self-contained

full rationale

The paper's core contribution rests on controlled empirical experiments that measure scheduler performance under specific out-of-distribution DAG topology and environment shifts. These evaluations directly observe degradation and attribute it to message-passing disruption without any equations, fitted parameters, or predictions that reduce to the training inputs by construction. No self-definitional loops, renamed known results, or load-bearing self-citations appear in the derivation; the analysis is falsifiable via the reported test regimes and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the chosen OOD test regimes are meaningful.

pith-pipeline@v0.9.0 · 5448 in / 993 out tokens · 28642 ms · 2026-05-10T18:07:31.961849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    Karan Ajmera and T. K. Tewari. 2024. A systematic literature review on contemporary and future trends in virtual machine scheduling techniques in cloud and multi-access computing.Frontiers in Computer Science6 (2024). doi:10.3389/fcomp.2024.1288552

  2. [2]

    Hamid Arabnejad and Jorge G. Barbosa. 2014. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table.IEEE Transactions on Parallel and Distributed Systems25, 3 (2014), 682–694

  3. [3]

    Anton Beloglazov and Rajkumar Buyya. 2012. Optimal Online Deterministic Algorithms and Adaptive Heuristics for Energy and Performance Efficient Dynamic Consolidation of Virtual Machines in Cloud Data Centers.Concurrency and Computation(2012)

  4. [4]

    Anton Beloglazov and Rajkumar Buyya. 2012. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers.Concurrency and Computation(2012)

  5. [5]

    Yoshua Bengio and Yann LeCun. 2007. Scaling Learning Algorithms Towards AI. InLarge Scale Kernel Machines. MIT Press

  6. [6]

    Luiz Fernando Bittencourt and Edmundo Roberto Mauro Madeira. 2018. HCOC: A cost optimization algorithm for workflow scheduling in hybrid clouds.IEEE Transactions on Cloud Computing6, 3 (2018), 649–662

  7. [7]

    Shaileshh Bojja, Hongzi Mao, Malte Schwarzkopf, and Mohammad Alizadeh. 2019. Learning Graph-based Cluster Scheduling Algorithms.ACM SIGCOMM Computer Communication Review(2019)

  8. [8]

    Rodrigo Calheiros et al. 2011. CloudSim: A toolkit for modeling and simulation of cloud computing environments. Software: Practice and Experience(2011)

  9. [9]

    Sunera Chandrasiri and Dulani Meedeniya. 2025. Energy-efficient dynamic workflow scheduling in cloud environ- ments using deep learning.Sensors25, 5 (2025), 1428

  10. [10]

    Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. InProceedings of the 26th Symposium on Operating Systems Principles. 153–167

  11. [11]

    da Silva and Luiz Fernando Bittencourt

    Daniel G. da Silva and Luiz Fernando Bittencourt. 2017. Learning-based workflow scheduling in cloud computing: A survey.Journal of Cloud Computing6, 1 (2017), 1–20

  12. [12]

    Rafael Ferreira da Silva, Henri Casanova, Mats Rynge, Karan Vahi, Ewa Deelman, et al . 2020. A Community Framework for Enabling Scientific Workflow Research and Development.arXiv preprint arXiv:2009.00250(2020). https://arxiv.org/abs/2009.00250

  13. [13]

    Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. 2009. Workflows and e-Science: An overview of workflow system features and capabilities.Future Generation Computer Systems25, 5 (2009), 528–540

  14. [14]

    Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al

    Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al. 2015. Pegasus, a Workflow Management System for Science. Future Generation Computer Systems46 (2015), 17–35. doi:10.1016/j.future.2014.10.008

  15. [15]

    Durillo and Radu Prodan

    Juan J. Durillo and Radu Prodan. 2014. Multi-objective Workflow Scheduling in Amazon EC2. InCluster Computing, Vol. 17. 169–189. [from the Wiley survey] 2023)]EnergyWorkflowSurvey2023 [Authors from the Wiley survey]. 2023. A survey on energy- efficient workflow scheduling algorithms in cloud computing environment.Software: Practice and Experience (2023). ...

  16. [16]

    2016.Deep learning

    Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016.Deep learning. Vol. 1. MIT Press

  17. [17]

    2016.Machine Learning Applications for Data Center Optimization

    Google. 2016.Machine Learning Applications for Data Center Optimization. Technical Report. Google. https: //deepmind.google/discover/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40/ DeepMind AI Reduces Google Data Centre Cooling Bill by 40%

  18. [18]

    R. L. Graham et al. 1979. Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey. Annals of Discrete Mathematics(1979)

  19. [19]

    Zhaochen Gu, Sihai Tang, Beilei Jiang, Song Huang, Qiang Guan, and Song Fu. 2021. Characterizing Job-Task Dependency in Cloud Workloads Using Graph Learning. In2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 288–297. doi:10.1109/IPDPSW52791.2021.00052

  20. [20]

    Anas Hattay, Fred Ngole Mboula, Eric Gascard, and Zakaria Yahouni. 2024. Evaluating energy-aware cloud task scheduling techniques: A comprehensive dialectical approach. In2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC). IEEE, 109–118

  21. [21]

    Hinton, Simon Osindero, and Yee Whye Teh

    Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation18 (2006), 1527–1554

  22. [22]

    Biao Hu, Xincheng Yang, and Mingguo Zhao. 2023. Online energy-efficient scheduling of DAG tasks on heteroge- neous embedded platforms.Journal of Systems Architecture140 (2023), 102894

  23. [23]

    2025.Data Centres and Data Transmission Networks

    International Energy Agency. 2025.Data Centres and Data Transmission Networks. Technical Report. IEA. https://www.iea.org/energy-system/buildings/data-centres-and-data-transmission-networks

  24. [24]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR

  25. [25]

    Konda and John N

    Vijay R. Konda and John N. Tsitsiklis. 1999. Actor-Critic Algorithms. InAdvances in Neural Information Processing Systems, Vol. 12

  26. [26]

    Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. InProceedings of the 15th ACM Workshop on Hot Topics in Networks. 50–56

  27. [27]

    Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. InACM SIGCOMM. 270–288

  28. [28]

    Mingwei Mao and Marty Humphrey. 2012. A survey of dynamic resource management in cloud computing.IEEE Communications Surveys & Tutorials14, 4 (2012), 1101–1117

  29. [29]

    Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. InProceedings of the 33rd International Conference on Machine Learning (ICML, Vol. 48). 1928–1937

  30. [30]

    Junyoung Park, Jaehyeong Chun, Sang Hun Kim, Youngkook Kim, and Jinkyoo Park. 2021. Learning to schedule job-shop problems: representation and policy learning using graph neural network and reinforcement learning. International journal of production research59, 11 (2021), 3360–3377

  31. [31]

    2012.Scheduling: Theory, Algorithms, and Systems

    Michael Pinedo. 2012.Scheduling: Theory, Algorithms, and Systems. Springer

  32. [32]

    2005.The logic of scientific discovery

    Karl Popper. 2005.The logic of scientific discovery. Routledge

  33. [33]

    Lownenthal, Bronis R

    Barry Rountree, David K. Lownenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler Bletsch

  34. [34]

    Adagio: Making DVS Practical for Complex HPC Applications.ACM International Conference on Supercom- puting(2009), 460–469

  35. [35]

    Rizos Sakellariou, Henan Zhao, and Ewa Deelman. 2010. Mapping workflows on grid resources: Experiments with the montage workflow. InGrids, P2P and services computing. Springer, 119–132

  36. [36]

    John Schulman et al. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In ICLR

  37. [37]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347(2017)

  38. [38]

    Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In2020 USENIX Annual Technical Conference. 205–218

  39. [39]

    S&P Global Commodity Insights. 2025. Global data center power demand to double by 2030 on AI. https: //www.spglobal.com/energy/. Accessed: 2025-12-10

  40. [40]

    1998.Reinforcement learning: An introduction

    Richard S Sutton, Andrew G Barto, et al. 1998.Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge

  41. [41]

    2014.Workflows for e-Science: Scientific Workflows for Grids

    Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014.Workflows for e-Science: Scientific Workflows for Grids. Springer

  42. [42]

    Huangshi Tian, Yunchuan Zheng, and Wei Wang. 2019. Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud. InProceedings of the ACM Symposium on Cloud Computing (SoCC). doi:10. 1145/3357223.3362710

  43. [43]

    Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 2002. Performance-Effective and Low-Complexity Task Schedul- ing for Heterogeneous Computing.IEEE Transactions on Parallel and Distributed Systems13, 3 (2002), 260–274

  44. [44]

    Laurens Versluis and Alexandru Iosup. 2020. A Survey and Annotated Bibliography of Workflow Scheduling in Computing Infrastructures: Community, Keyword, and Article Reviews.arXiv preprint arXiv:2004.10077(2020)

  45. [45]

    Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022. Handling Distribution Shifts on Graphs: An Invariance Perspective. InInternational Conference on Learning Representations (ICLR)

  46. [46]

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations (ICLR)

  47. [47]

    Kaixiang Zhang, Wenbo Li, Kai Zhang, Zenglin Li, and Zhaohui Qin. 2021. DeepJS: Job-shop scheduling with deep reinforcement learning.IEEE Transactions on Industrial Informatics17, 3 (2021), 2083–2093

  48. [48]

    Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert da Fonseca. 2003. Per- formance assessment of multiobjective optimizers: An analysis and review.IEEE Transactions on Evolutionary Computation7, 2 (2003), 117–132

  49. [49]

    Zhiling Zong, Adam Manzanares, Xiaojun Ruan, and Xiao Qin. 2007. EAD and PEBD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters. 56, 12 (2007), 1661–1675