pith. machine review for the scientific record. sign in

arxiv: 2605.00209 · v1 · submitted 2026-04-30 · 💻 cs.DC

Recognition: unknown

Replication in Graph Partitioning and Scheduling Problems

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:51 UTC · model grok-4.3

classification 💻 cs.DC
keywords graph partitioningreplicationDAG schedulinghypergraph partitioningparallel computingcommunication reductioninteger linear programmingcost optimization
0
0 comments X

The pith

Permitting replication of nodes cuts total costs in hypergraph partitioning by 17-65 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies what happens when nodes or tasks in graph partitioning and DAG scheduling may be assigned to more than one processor. It shows that this replication trades a small increase in duplicated work for often large drops in communication between processors. Experiments using exact integer linear programming on real-world graphs find that the combined cost falls by 17 to 65 percent for partitioning and by 11 to 23 percent for scheduling, with some cases needing no communication at all. The authors also prove that replication makes partitioning harder to approximate while leaving scheduling complexity largely unchanged. These results matter because the models guide how large computations are split across parallel machines, and the usual single-assignment restriction may be leaving measurable performance behind.

Core claim

For graph partitioning the authors prove that replication raises the approximation complexity of the problem, while for DAG scheduling the complexity impact is milder. On the experimental side they formulate integer linear programs that compute optimal costs with and without replication; on real graphs these show average cost reductions of 17-65 percent for hypergraph partitioning and 11-23 percent for DAG scheduling, sometimes eliminating all communication. A scalable heuristic extends the same comparison to larger scheduling instances.

What carries the argument

Replication, the assignment of the same node or task to multiple processors so that incoming data need not cross the network.

If this is right

  • Optimal total cost drops when replication is allowed, sometimes to the point that communication volume reaches zero.
  • Mean cost reductions of 17-65 percent occur for hypergraph partitioning and 11-23 percent for DAG scheduling across the tested graphs.
  • Approximation hardness increases for partitioning problems once replication is introduced.
  • A heuristic solver can still deliver the same cost reductions on workloads too large for exact ILP methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Partitioning tools and schedulers used in production systems could be extended to consider replication as an explicit degree of freedom.
  • The same cost models might be applied to decide replication dynamically at runtime rather than only in a static planning phase.
  • Further experiments could test whether the observed gains persist when the underlying hardware has non-uniform communication costs.

Load-bearing premise

The chosen real-world graphs and the integer-linear-programming or heuristic cost models accurately reflect the trade-offs that appear in actual large-scale parallel executions.

What would settle it

Measure wall-clock runtime and actual network traffic on a real distributed system when the same workloads are executed from replicated versus non-replicated assignments produced by the ILP or heuristic.

Figures

Figures reproduced from arXiv: 2605.00209 by A. N. Yzelman, P\'al Andr\'as Papp, Toni B\"ohnlein.

Figure 1
Figure 1. Figure 1: Illustration of hypergraph bipartitioning into two [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: BSP schedule of a DAG on 2 processors and 3 super￾steps. Node 𝑣 is originally computed in 𝑉1,2. Replicating 𝑣 in 𝑉2,3 allows to remove the dashed communication. Note: both parents of 𝑣 are already present on processor 2 by 𝑉2,3. ∃ 𝑝 ′ ∈ [𝑃] \ {𝑝} such that (𝑣, 𝑝) ∈ Γ𝑝 ′ ,𝑠′ . A valid BSP schedule must satisfy the following properties for all 𝑝 ∈ [𝑃] and 𝑠 ∈ [𝑆]: • All inputs of the operations are available… view at source ↗
Figure 3
Figure 3. Figure 3: Examples for each ingredient of the advanced heuristic. In batch replication (left), we remove the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of cost reduction ratios on each dataset, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration cost reduction ratio with replication on the different datasets for [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
read the original abstract

The efficient parallel execution of complex computations requires balancing the workload across processors while minimizing the communication between them. This inherent trade-off is often captured by graph partitioning or DAG scheduling problems. For the sake of model simplicity, most works on these problems assume that nodes can be assigned to only a single processor. However, in reality, replicating an operation on several processors can easily be beneficial: it may increase the computational costs only by a small amount, while significantly reducing the communication requirements. Our goal is to provide a comprehensive analysis of the impact of replication on partitioning and scheduling problems. On the theoretical side, we show that for graph partitioning, replication makes the problem significantly harder in terms of approximation complexity, whereas for scheduling, its impact on complexity seems less prominent. On the experimental side, we conduct a thorough analysis of the cost reduction obtainable by replication, on a wide range of graphs from real-world applications. For hypergraph partitioning, we use Integer Linear Programming (ILP) formulations to compare the optimal costs; our experiments show that replication can reduce the cost by 17%-65% on average, or even entirely remove the need for communication in some cases. For DAG scheduling, we similarly use ILPs on smaller graphs, and develop a sophisticated heuristic that is also applicable to much larger workloads. Our experiments here demonstrate a mean cost reduction of 11.61%-23.13% with replication, or even up to 58.17% in some cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates the benefits of replication in graph partitioning and DAG scheduling problems for parallel execution. Theoretically, replication is shown to make graph partitioning significantly harder in terms of approximation complexity, while its impact on scheduling complexity is less pronounced. Experimentally, ILP is used to find optimal costs with and without replication on small instances, and a heuristic is developed for larger DAGs. Results indicate average cost reductions of 17%-65% for hypergraph partitioning, sometimes eliminating communication entirely, and mean reductions of 11.61%-23.13% for DAG scheduling, with peaks up to 58.17%, based on real-world application graphs.

Significance. Should the cost models prove representative of actual hardware behavior, this work would be significant for advancing optimization techniques in distributed computing by demonstrating how replication can substantially lower communication costs at modest computational expense. The theoretical complexity results provide context for why replication has been under-explored, and the experimental data on diverse graphs offers concrete evidence of potential gains.

major comments (2)
  1. The ILP formulations used to obtain the optimal costs with and without replication, which underpin the 17%-65% reduction claim, define replication cost as an additive term on computation while reducing communication; however, this may not account for additional overheads such as memory duplication, synchronization, and cache effects that arise in real executions, potentially making the reported benefits artifacts of the abstraction rather than reliable predictors.
  2. In the DAG scheduling section, the heuristic's results showing 11.61%-23.13% mean reductions lack details on graph selection criteria and statistical significance tests, which are necessary to substantiate the broader applicability of the findings beyond the tested instances.
minor comments (2)
  1. The abstract states 'or even entirely remove the need for communication in some cases' without specifying how often this occurs or on which graphs, which could be expanded for clarity.
  2. References to prior work on replication in scheduling could be more comprehensive to better position the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and describe the planned revisions.

read point-by-point responses
  1. Referee: The ILP formulations used to obtain the optimal costs with and without replication, which underpin the 17%-65% reduction claim, define replication cost as an additive term on computation while reducing communication; however, this may not account for additional overheads such as memory duplication, synchronization, and cache effects that arise in real executions, potentially making the reported benefits artifacts of the abstraction rather than reliable predictors.

    Authors: We agree that the ILP cost model is an abstraction that treats replication cost as an additive computation term while crediting communication savings, without modeling memory duplication, synchronization, or cache effects. This is a deliberate simplification common to theoretical analyses of partitioning and scheduling, intended to isolate the communication-reduction potential of replication. We acknowledge that these omitted factors could reduce the practical benefits in real executions. In the revised manuscript we will add an explicit discussion of the model's limitations, including the potential impact of unmodeled overheads, and will outline directions for more hardware-aware cost models. revision: yes

  2. Referee: In the DAG scheduling section, the heuristic's results showing 11.61%-23.13% mean reductions lack details on graph selection criteria and statistical significance tests, which are necessary to substantiate the broader applicability of the findings beyond the tested instances.

    Authors: We thank the referee for this suggestion. The DAGs were selected from standard benchmark collections representing real-world applications; however, we agree that explicit selection criteria and statistical validation are needed. We will revise the manuscript to include a dedicated subsection describing the graph sources, sizes, and structural characteristics, together with the rationale for their inclusion. We will also report statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) on the cost reductions to support the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical hardness results and ILP experiments are self-contained.

full rationale

The paper's theoretical claims consist of approximation-complexity statements (hardness results for partitioning with replication) that are proven from standard definitions rather than derived from fitted parameters or self-citations. Experimental sections solve ILP formulations and run heuristics on external real-world graphs to compute explicit cost differences between replication-allowed and baseline objectives; these are direct evaluations, not predictions that reduce to the input data by construction. No self-definitional loops, fitted-input renamings, or load-bearing self-citation chains appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard graph-theoretic definitions and off-the-shelf ILP solvers without introducing new free parameters, ad-hoc axioms, or invented entities.

pith-pipeline@v0.9.0 · 5569 in / 1036 out tokens · 67139 ms · 2026-05-09T19:51:20.550673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Supplementary material

    2026. Supplementary material. The partitioning and scheduling algorithms in the paper are implemented within the OneStopParellel scheduling toolbox on GitHub: https://github.com/Algebraic-Programming/OneStopParallel. The new MoE hypergraph benchmarks and the data from our experiments are avail- able at: https://github.com/Algebraic-Programming/Artifacts/t...

  2. [2]

    Ishfaq Ahmad and Yu-Kwong Kwok. 1998. On exploiting task duplication in parallel program scheduling.IEEE Transactions on parallel and distributed systems 9, 9 (1998), 872–892

  3. [3]

    Benny Applebaum, Boaz Barak, and Avi Wigderson. 2010. Public-Key Cryptog- raphy from Different Assumptions. InProceedings of the 42nd ACM Symposium on Theory of Computing (STOC). ACM, 171–180

  4. [4]

    2020.Parallel Scientific Computation: A Structured Approach Using BSP

    Rob H Bisseling. 2020.Parallel Scientific Computation: A Structured Approach Using BSP. Oxford University Press, USA

  5. [5]

    Toni Böhnlein, Pál András Papp, Raphael S Steiner, and Albert-Jan N Yzelman

  6. [6]

    In2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

    Efficient Parallel Scheduling for Sparse Triangular Solvers. In2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1263–1265

  7. [7]

    Toni Böhnlein, Pál András Papp, and Albert-Jan N Yzelman. 2025. Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade- Offs. InInternational Colloquium on Structural Information and Communication Complexity (SIROCCO). Springer, 109–126

  8. [8]

    Doruk Bozdag, Fusun Ozguner, and Umit V Catalyurek. 2008. Compaction of schedules and a two-stage approach for duplication-based DAG scheduling.IEEE Transactions on Parallel and Distributed Systems20, 6 (2008), 857–871

  9. [9]

    Israel Casas, Javid Taheri, Rajiv Ranjan, Lizhe Wang, and Albert Y Zomaya. 2017. A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systems.Future Generation Computer Systems74 (2017), 168–178

  10. [10]

    Ümit Çatalyürek, Karen Devine, Marcelo Faraj, Lars Gottesbüren, Tobias Heuer, Henning Meyerhenke, Peter Sanders, Sebastian Schlag, Christian Schulz, Daniel Seemaier, et al. 2023. More recent advances in (hyper) graph partitioning.Comput. Surveys55, 12 (2023), 1–38

  11. [11]

    Umit V Catalyurek and Cevdet Aykanat. 2002. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication.IEEE Transactions on parallel and distributed systems10, 7 (2002), 673–693

  12. [12]

    Eden Chlamtáč, Michael Dinitz, and Yury Makarychev. 2017. Minimizing the Union: Tight Approximations for Small Set Bipartite Vertex Expansion. InPro- ceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 881–899

  13. [13]

    Edward G Coffman and Ronald L Graham. 1972. Optimal scheduling for two- processor systems.Acta informatica1, 3 (1972), 200–213

  14. [14]

    David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. 1993. LogP: Towards a realistic model of parallel computation. InProceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP). 1–12

  15. [15]

    Sami Davies, Janardhan Kulkarni, Thomas Rothvoss, Jakub Tarnawski, and Yihao Zhang. 2020. Scheduling with communication delays via LP hierarchies and clustering. In61st Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 822–833

  16. [16]

    Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection.ACM Transactions on Mathematical Software (TOMS)38, 1 (2011), 1–25

  17. [17]

    Uriel Feige, Robert Krauthgamer, and Kobbi Nissim. 2000. Approximating the minimum bisection size. InProceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC). ACM, 530–536

  18. [18]

    2002.Computers and intractability

    Michael R Garey and David S Johnson. 2002.Computers and intractability. Vol. 29. Wh freeman New York

  19. [19]

    Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. 2023. Cardinal Optimizer (COPT) user guide. https://guide.coap.online/copt/en-doc

  20. [20]

    Seokjin Go and Divya Mahajan. 2025. Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing.arXiv preprint arXiv:2502.06643(2025)

  21. [21]

    Mohammad Taghi Hajiaghayi, Theodore Johnson, Mohammad Reza Khani, and Barna Saha. 2014. Hierarchical graph partitioning. In26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 51–60

  22. [22]

    Claire Hanen and Alix Munier. 1995. An approximation algorithm for sched- uling dependent tasks on m processors with small communication delays. In Proceedings 1995 INRIA/IEEE Symposium on Emerging Technologies and Factory Automation (ETFA), Vol. 1. IEEE, 167–189

  23. [23]

    Jonathan MD Hill, Bill McColl, Dan C Stefanescu, Mark W Goudreau, Kevin Lang, Satish B Rao, Torsten Suel, Thanasis Tsantilas, and Rob H Bisseling. 1998. BSPlib: The BSP programming library.Parallel Comput.24, 14 (1998), 1947–1980

  24. [24]

    Diyi Hu and Bhaskar Krishnamachari. 2019. Throughput Optimized Scheduler for Dispersed Computing Systems.. InMobileCloud. 76–84

  25. [25]

    Engelina L Jenneskens and Rob H Bisseling. 2022. Exact k-way sparse matrix par- titioning. In2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 754–763

  26. [26]

    George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. 1999. Multi- level hypergraph partitioning: Applications in VLSI domain.IEEE Transactions on Very Large Scale Integration (VLSI) Systems7 (1999), 69–79

  27. [27]

    Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. 2014. Prob- abilistic Sentential Decision Diagrams. InProceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR)

  28. [28]

    Knigge and Rob H

    Timon E. Knigge and Rob H. Bisseling. 2020. An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning.Parallel Comput.96 (2020), 102640

  29. [29]

    Robert Krauthgamer and Uriel Feige. 2006. A Polylogarithmic Approximation of the Minimum Bisection.SIAM Rev.48 (2006), 99–130

  30. [30]

    Robert Krauthgamer, Joseph Naor, and Roy Schwartz. 2009. Partitioning graphs into balanced components. InProceedings of the 20th annual ACM-SIAM sympo- sium on Discrete algorithms (SODA). SIAM, 942–949

  31. [31]

    Janardhan Kulkarni, Shi Li, Jakub Tarnawski, and Minwei Ye. 2020. Hierarchy- based algorithms for minimizing makespan under precedence and communica- tion constraints. InProceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2770–2789

  32. [32]

    Tom Leighton and Satish Rao. 1999. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms.J. ACM46, 6 (1999), 787– 832

  33. [33]

    Renaud Lepere and Christophe Rapine. 2002. An Asymptotic {𝑂} (𝑙𝑛𝜌/𝑙𝑛𝑙𝑛𝜌) - Approximation Algorithm for the Scheduling Problem with Duplication on Large Communication Delay Graphs. InAnnual Symposium on Theoretical Aspects of Computer Science (STACS). Springer, 154–165

  34. [34]

    Elaine Levey and Thomas Rothvoss. 2016. A (1+ epsilon)-approximation for makespan scheduling with precedence constraints using LP hierarchies. InPro- ceedings of the 48th annual ACM Symposium on Theory of Computing (STOC). 168–177

  35. [35]

    Quanquan C Liu, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua R Wang

  36. [36]

    In39th International Symposium on Theoretical Aspects of Computer Science (STACS)

    Scheduling with Communication Delay in Near-Linear Time. In39th International Symposium on Theoretical Aspects of Computer Science (STACS). Schloss Dagstuhl-Leibniz-Zentrum für Informatik

  37. [37]

    Jaron Maene, Vincent Derkinderen, and Pedro Zuidberg Dos Martires. 2025. KLay: Accelerating Arithmetic Circuits for Neurosymbolic AI. InThe Thirteenth International Conference on Learning Representations (ICLR)

  38. [38]

    Biswaroop Maiti, Rajmohan Rajaraman, David Stalfa, Zoya Svitkina, and Aravin- dan Vijayaraghavan. 2020. Scheduling precedence-constrained jobs on related machines with communication delay. In61st Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 834–845

  39. [39]

    Pasin Manurangsi. 2017. Almost-Polynomial Ratio ETH-Hardness of Approxi- mating Densest k-Subgraph. InProceedings of the 49th Annual ACM Symposium on Theory of Computing (STOC). ACM, 954–961

  40. [40]

    WF McColl and A Tiskin. 1999. Memory-efficient matrix computations in the BSP model.Algorithmica24, 3-4 (1999), 287–297

  41. [41]

    Alix Munier and Claire Hanen. 1997. Using duplication for scheduling unitary tasks on m processors with unit communication delays.Theoretical Computer Science178, 1-2 (1997), 119–127

  42. [42]

    Rudolf Müller and Dorothea Wagner. 1991. 𝛼-Vertex separator is NP-hard even for 3-regular graphs.Computing46, 4 (1991), 343–353

  43. [43]

    Yusuf Özkaya, Anne Benoit, Bora Uçar, Julien Herrmann, and Ümit V

    M. Yusuf Özkaya, Anne Benoit, Bora Uçar, Julien Herrmann, and Ümit V. Çatalyürek. 2019. A Scalable Clustering-Based Task Scheduler for Homoge- neous Processors Using DAG Partitioning. InIEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 155–165

  44. [44]

    David A Papa and Igor L Markov. 2007. Hypergraph Partitioning and Clustering. Handbook of Approximation Algorithms and Metaheuristics(2007)

  45. [45]

    Christos Papadimitriou and Mihalis Yannakakis. 1988. Towards an architecture- independent analysis of parallel algorithms. InProceedings of the 20th annual ACM symposium on Theory of computing (STOC). 510–513

  46. [46]

    Pál András Papp, Georg Anegg, Aikaterini Karanasiou, and Albert-Jan N Yzelman

  47. [47]

    InProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)

    Efficient Multi-Processor Scheduling in Increasingly Realistic Models. InProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 463–474

  48. [48]

    Pál András Papp, Georg Anegg, and Albert-Jan N Yzelman. 2023. Partitioning hypergraphs is hard: Models, inapproximability, and applications. In35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 415–425

  49. [49]

    Pál András Papp, Georg Anegg, and Albert-Jan N Yzelman. 2025. DAG Scheduling in the BSP Model. InInternational Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM). Springer, 238–253

  50. [50]

    Pál András Papp, Toni Böhnlein, and Albert-Jan N Yzelman. 2025. Multiprocessor Scheduling with Memory Constraints: Fundamental Properties and Finding Optimal Solutions. InProceedings of the 54th International Conference on Parallel Processing (ICPP). 238–247

  51. [51]

    Daniël M Pelt and Rob H Bisseling. 2015. An exact algorithm for sparse matrix bipartitioning.J. Parallel and Distrib. Comput.85 (2015), 79–90. 11 Pál András Papp, Toni Böhnlein, and Albert-Jan N. Yzelman

  52. [52]

    Harald Räcke. 2008. Optimal hierarchical decompositions for congestion min- imization in networks. InProceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC). 255–264

  53. [53]

    Harald Räcke, Roy Schwartz, and Richard Stotz. 2018. Trees for Vertex Cuts, Hypergraph Cuts and Minimum Hypergraph Bisection. In30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 23–32

  54. [54]

    Harald Räcke and Richard Stotz. 2016. Improved Approximation Algorithms for Balanced Partitioning Problems. In33rd International Symposium on Theoretical Aspects of Computer Science (STACS) (LIPIcs, Vol. 47). 58:1–58:14

  55. [55]

    Sebastian Schlag, Vitali Henne, Tobias Heuer, Henning Meyerhenke, Peter Sanders, and Christian Schulz. 2016. k-way hypergraph partitioning via n-level recursive bisection. InProceedings of the 18th Workshop on Algorithm Engineering and Experiments (ALENEX). 53–67

  56. [56]

    Nimish Shah, Wannes Meert, and Marian Verhelst. 2022. DPU-v2: Energy- efficient execution of irregular directed acyclic graphs. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1288–1307

  57. [57]

    Ola Svensson. 2010. Conditional hardness of precedence constrained scheduling on identical machines. InProceedings of the 42nd ACM Symposium on Theory of Computing (STOC). 745–754

  58. [58]

    Aleksandar Trifunović and William J Knottenbelt. 2008. Parallel multilevel algorithms for hypergraph partitioning.J. Parallel and Distrib. Comput.68, 5 (2008), 563–581

  59. [59]

    Leslie G Valiant. 1990. A bridging model for parallel computation.Commun. ACM33, 8 (1990), 103–111

  60. [60]

    Leslie G Valiant. 2008. A bridging model for multi-core computing. InEuropean Symposium on Algorithms (ESA). Springer, 13–28

  61. [61]

    Bart Veltman, BJ Lageweg, and Jan Karel Lenstra. 1990. Multiprocessor scheduling with communication delays.Parallel computing16, 2-3 (1990), 173–182

  62. [62]

    Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Zhengding Hu, Shuyi Pei, Yangwook Kang, Yufei Ding, and Po-An Tsai. 2025. Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting.arXiv preprint arXiv:2510.05497(2025)

  63. [63]

    AN Yzelman, Rob H Bisseling, Dirk Roose, and Karl Meerbergen. 2014. Multi- coreBSP for C: a high-performance library for shared-memory parallel program- ming.International Journal of Parallel Programming42, 4 (2014), 619–642

  64. [64]

    Behrooz Zarebavani, Kazem Cheshmi, Bangtian Liu, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2022. HDagg: hybrid aggregation of loop-carried dependence iterations in sparse matrix computations. In2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1217–1227. A Theoretical results We first discuss the technical details of th...