Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving

I-Hong Hou; Ting He; Tingyang Sun

arxiv: 2604.14993 · v1 · submitted 2026-04-16 · 💻 cs.DC · cs.PF

Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving

Tingyang Sun , Ting He , I-Hong Hou This is my paper

Pith reviewed 2026-05-10 10:07 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords server chain compositionblock placementcache allocationpipeline parallelismlarge language modelsNP-hard optimizationdistributed servingload balancing

0 comments

The pith

Formulating foundation model serving as chain-structured jobs with block placement and cache allocation reduces response times in distributed LLM systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new optimization problem of server chain composition for placing memory blocks and allocating cache when serving jobs that follow a chain structure and demand large memory. This formulation directly models the pipeline-parallel execution used in large foundation model serving. The authors establish that finding the optimal solution is NP-hard and then develop scalable algorithms that deliver performance guarantees when combined with standard load balancing. When these algorithms are applied to a distributed large language model serving system, measured response times drop compared with current methods. A reader would care because foundation models are increasingly central to AI services yet their memory demands make efficient distributed serving a practical bottleneck.

Core claim

The central claim is that the server chain composition problem, which jointly optimizes block placement and cache allocation for chain-structured jobs with large memory footprints, is NP-hard, yet admits scalable algorithms with provable guarantees under load balancing; applying these algorithms to pipeline-parallel large language model serving produces significantly lower response times than prior solutions.

What carries the argument

Server chain composition: the joint decision of which server hosts each memory block and how cache space is allocated along a pipeline of chain-structured jobs.

If this is right

The algorithms can be deployed in production pipeline-parallel serving systems without violating load-balancing assumptions.
Response-time improvements scale with the number of chain stages and memory blocks.
Guarantees remain valid for any chain-structured workload whose memory layout follows the same block-and-cache model.
The formulation separates placement decisions from runtime scheduling, allowing modular implementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same placement-plus-allocation framing could be tested on other memory-heavy pipeline workloads such as video processing chains or scientific simulation pipelines.
Dynamic variants that re-optimize placement when new models are loaded would be a direct next step.
Hardware-aware extensions that incorporate GPU-to-GPU interconnect costs could tighten the performance bounds further.

Load-bearing premise

The abstraction of foundation model serving as chain-structured jobs whose performance is governed by block placement and cache allocation accurately reflects real system behavior and trade-offs.

What would settle it

Running the proposed algorithms on a real distributed LLM serving cluster and measuring no meaningful drop in end-to-end response times relative to existing schedulers would show the practical claim does not hold.

Figures

Figures reproduced from arXiv: 2604.14993 by I-Hong Hou, Ting He, Tingyang Sun.

**Figure 2.** Figure 2: Opportunity for further optimization after block placement with cache reservation: for [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation of GBP-CR under (a) homogeneous settings and (b) heterogeneous settings ( [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of GCA under a fixed block placement produced by GBP-CR ( [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation of JFFC under a fixed set of server chains produced by GBP-CR + GCA (for [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Optimizing design parameter 𝑐 (marked by ★). much cache reservation, the lower bound from Theorem 3.7 yields a configuration 𝑐 ∗ that minimizes the mean response time by reserving the right amount of cache space for each placed block [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Optimal design parameter 𝑐 ∗ as a function of arrival rate 𝜆. blocks and dynamically route requests without explicitly composing server chains or allocating cache space ahead of time. The results in [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Overall performance comparison with state-of-the-art solutions (the case of [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Computation time profiling (by default, input length is 2,000 and output length is 20; MIG 3g [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Communication time profiling (by default, input length is 2,000 and output length is 20; assume [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Empirical CDF vs. exponential distribution CDF for the Azure LLM inference trace: (a) [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Transition diagram of the system state under JFFC for [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

read the original abstract

As a current trend in Artificial Intelligence (AI), large foundation models are increasingly employed as the core of AI services. However, even after training, serving such models at scale remains a challenging task due to their heavy resource footprints, particularly in terms of GPU memory. While recent works revealed unique characteristics of systems serving foundation models that distinguish them from traditional distributed computing systems, there is still a lack of fundamental understanding of the underlying system management problems. This work aims at addressing this gap by extracting a novel problem of "server chain composition" via block placement and cache allocation for serving chainstructured jobs with large memory footprints, which models a fundamental problem in serving large foundation models through pipeline parallelism. After showing the NP-hardness of the optimal solution, the focus is turned to developing scalable algorithms with guaranteed performance under state-of-the-art load balancing. Application of the proposed solution to a distributed large language model (LLM) serving system shows significant reduction of response times compared to state-of-the-art solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extracts a chain-composition scheduling problem from LLM pipeline parallelism, proves NP-hardness, and gives approximation algorithms with load-balancing guarantees, but the claimed response-time wins rest on how faithfully the block-placement model matches actual inference dynamics.

read the letter

The main takeaway is that this work isolates server chain composition—deciding block placement and cache allocation for chain-structured jobs with large memory footprints—as a distinct problem that arises in pipeline-parallel foundation model serving. They establish NP-hardness and then supply scalable algorithms that carry performance guarantees when standard load-balancing assumptions hold. That framing and the formal treatment are the clearest additions to the literature on distributed LLM inference systems.

Referee Report

2 major / 1 minor

Summary. The manuscript formulates the 'server chain composition' problem for serving chain-structured jobs with large memory footprints via block placement and cache allocation, modeling pipeline-parallel execution of large foundation models. It proves NP-hardness of the optimal solution, develops scalable algorithms with performance guarantees under load-balancing assumptions, and reports significant response-time reductions when the approach is applied to a distributed LLM serving system relative to prior solutions.

Significance. If the modeling assumptions hold, the work supplies both a theoretical foundation and practical algorithms for a core resource-management challenge in scaling foundation-model inference. The combination of hardness results, approximation guarantees, and end-to-end LLM evaluation would be a notable contribution to distributed systems for AI workloads.

major comments (2)

[Evaluation] The headline empirical claim of significant response-time reduction rests on mapping real LLM workloads onto the chain-structured job abstraction with fixed homogeneous memory blocks. No validation is provided that this abstraction faithfully reproduces pipeline dynamics, variable sequence lengths, attention KV-cache growth, or batching interference; without such evidence the observed gains cannot be confidently attributed to the proposed algorithms rather than to the modeling choices themselves.
[Algorithms and Analysis] The performance guarantees of the scalable algorithms are derived under 'state-of-the-art load balancing' assumptions. It is not shown whether these assumptions remain valid in the distributed LLM regime used for the experiments; if they do not, the claimed approximation ratios become inapplicable to the reported results.

minor comments (1)

[Abstract] The abstract and introduction should explicitly define or cite the precise load-balancing model used for the guarantees.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] The headline empirical claim of significant response-time reduction rests on mapping real LLM workloads onto the chain-structured job abstraction with fixed homogeneous memory blocks. No validation is provided that this abstraction faithfully reproduces pipeline dynamics, variable sequence lengths, attention KV-cache growth, or batching interference; without such evidence the observed gains cannot be confidently attributed to the proposed algorithms rather than to the modeling choices themselves.

Authors: We agree that explicit validation of the abstraction's fidelity is valuable. The chain-structured model abstracts pipeline-parallel LLM execution by representing model stages as chains and memory requirements as fixed homogeneous blocks, which aligns with standard pipeline configurations under fixed batch sizes. The end-to-end experiments apply this mapping to real LLM workloads in a distributed serving system. To address the concern, we will revise the evaluation section to add a dedicated discussion of modeling assumptions, including how the abstraction approximates KV-cache behavior and pipeline dynamics for representative workloads, supported by references to practical systems. We will also include additional analysis showing sensitivity to key parameters where data permits. revision: partial
Referee: [Algorithms and Analysis] The performance guarantees of the scalable algorithms are derived under 'state-of-the-art load balancing' assumptions. It is not shown whether these assumptions remain valid in the distributed LLM regime used for the experiments; if they do not, the claimed approximation ratios become inapplicable to the reported results.

Authors: The approximation guarantees rely on load-balancing assumptions that match techniques used in modern distributed systems. Our LLM experiments are conducted in a setup that applies comparable load balancing to distribute chain-structured jobs. We will revise the algorithms and analysis section to explicitly verify and document that these assumptions hold in the evaluated distributed LLM regime, for example by reporting load distribution metrics from the experiments, thereby confirming the relevance of the guarantees to the results. revision: partial

Circularity Check

0 steps flagged

No circularity detected; modeling, NP-hardness proof, and algorithmic guarantees are independent of results

full rationale

The paper introduces a new abstraction (server chain composition via block placement and cache allocation for chain-structured jobs), proves NP-hardness of the optimal solution, and develops scalable algorithms whose performance guarantees are stated to hold under external state-of-the-art load balancing assumptions. The LLM serving evaluation is presented as an empirical application showing response-time reduction versus prior solutions. No self-definitional equations, no parameters fitted to data then relabeled as predictions, and no load-bearing self-citations or uniqueness theorems imported from the authors' prior work are visible in the provided text. The modeling choice is offered as a contribution that enables the subsequent analysis rather than being derived from the performance claims it supports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no mathematical derivations, parameter lists, or modeling assumptions are provided, so the ledger cannot be populated with specific entries.

pith-pipeline@v0.9.0 · 5473 in / 1018 out tokens · 51233 ms · 2026-05-10T10:07:06.064363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Sprenger

A. Sprenger. Ray vllm inference.https://github.com/asprenger/ray_vllm_inference. Ac- cessed: 2025-09-01

work page 2025
[2]

A study comparing waiting times in global and local queuing systems with heterogeneous workers.Applied Sciences, 14(9), 2024

Inessa Ainbinder, Evgeni Temnikov, and Miriam Allalouf. A study comparing waiting times in global and local queuing systems with heterogeneous workers.Applied Sciences, 14(9), 2024

work page 2024
[3]

Amazon elastic kubernetes service.https://aws.amazon.com/eks/

Amazon Web Services. Amazon elastic kubernetes service.https://aws.amazon.com/eks/. Ac- cessed: 2025-09-01

work page 2025
[4]

Demystifying parallel and distributed deep learning: An in-depth concurrency analysis.ACM Comput

Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis.ACM Comput. Surv., 52(4), Aug 2019

work page 2019
[5]

Asymptotic optimality of speed-aware jsq for hetero- geneous service systems.Performance Evaluation, 157-158:102320, 2022

Sanidhay Bhambay and Arpan Mukhopadhyay. Asymptotic optimality of speed-aware jsq for hetero- geneous service systems.Performance Evaluation, 157-158:102320, 2022

work page 2022
[6]

Distributed inference and fine-tuning of large language models over the internet

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, et al. Distributed inference and fine-tuning of large language models over the internet. InAdvances in Neural Information Processing Systems, volume 36, pages 12312–12331, 2023

work page 2023
[7]

Placement and routing optimization problem for service function chain: State of art and future opportunities.CoRR, abs/1910.02613, 2019

Weihan Chen, Xia Yin, Zhiliang Wang, and Xingang Shi. Placement and routing optimization problem for service function chain: State of art and future opportunities.CoRR, abs/1910.02613, 2019

work page arXiv 1910
[8]

Nvidia multi-instance gpu user guide.https://docs.nvidia.com/ datacenter/tesla/mig-user-guide/index.html, 2025

NVIDIA Corporation. Nvidia multi-instance gpu user guide.https://docs.nvidia.com/ datacenter/tesla/mig-user-guide/index.html, 2025. 23

work page 2025
[9]

Gromicho, Jelke J

Joaquim A.S. Gromicho, Jelke J. van Hoorn, Francisco Saldanha da Gama, and Gerrit T. Timmer. Solving the job-shop scheduling problem optimally by dynamic programming.Computers & Operations Research, 39(12):2968–2977, 2012

work page 2012
[10]

Joint placement and routing of network function chains in data centers

Linqi Guo, John Pang, and Anwar Walid. Joint placement and routing of network function chains in data centers. InIEEE INFOCOM 2018 - IEEE Conference on Computer Communications, pages 612–620, 2018

work page 2018
[11]

Computing science: The easiest hard problem.American Scientist, 90(2):113–117, 2002

Brian Hayes. Computing science: The easiest hard problem.American Scientist, 90(2):113–117, 2002

work page 2002
[12]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[13]

Branching-aware service function placement and routing in network function virtualization

Maryam Jalalitabar, Yang Wang, and Xiaojun Cao. Branching-aware service function placement and routing in network function virtualization. InIEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), pages 1–6, 2019

work page 2019
[14]

Joint optimization of service function placement and flow distribution for service function chaining.IEEE Journal on Selected Areas in Communications, 35(11):2532–2541, 2017

Insun Jang, Dongeun Suh, Sangheon Pack, and Gy ¨orgy D´an. Joint optimization of service function placement and flow distribution for service function chaining.IEEE Journal on Selected Areas in Communications, 35(11):2532–2541, 2017

work page 2017
[15]

Springer Berlin Heidelberg, 2004

Hans Kellerer, Ulrich Pferschy, and David Pisinger.Multidimensional Knapsack Problems, pages 235–283. Springer Berlin Heidelberg, 2004

work page 2004
[16]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolu- tional neural networks.Communications of the ACM, 60(6):84–90, May 2017

work page 2017
[17]

Larus, and Albert Greenberg

Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R. Larus, and Albert Greenberg. Join- idle-queue: A novel load balancing algorithm for dynamically scalable web services.Performance Evaluation, 68(11):1056–1071, 2011

work page 2011
[18]

Traffic aware placement of interdependent nfv middleboxes

Wenrui Ma, Oscar Sandoval, Jonathan Beltran, Deng Pan, and Niki Pissinou. Traffic aware placement of interdependent nfv middleboxes. InIEEE Conference on Computer Communications (INFOCOM), pages 1–9, 2017

work page 2017
[19]

Medhat, Tarik Taleb, Asma Elmangoush, et al

Ahmed M. Medhat, Tarik Taleb, Asma Elmangoush, et al. Service function chaining in next generation networks: State of the art and research challenges.IEEE Communications Magazine, 55(2):216–223, 2017

work page 2017
[20]

Load balancing and density dependent jump markov processes

Michael Mitzenmacher. Load balancing and density dependent jump markov processes. InProceedings of the 37th Annual Symposium on Foundations of Computer Science, page 213, 1996

work page 1996
[21]

Queueing, predictions, and large language models: Chal- lenges and open problems.Stochastic Systems, 15(3):195–219, 2025

Michael Mitzenmacher and Rana Shahout. Queueing, predictions, and large language models: Chal- lenges and open problems.Stochastic Systems, 15(3):195–219, 2025

work page 2025
[22]

Pipedream: generalized pipeline paral- lelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, et al. Pipedream: generalized pipeline paral- lelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, page 1–15, 2019

work page 2019
[23]

A datacenter scale distributed inference serving framework

NVIDIA. A datacenter scale distributed inference serving framework. https://github.com/ai-dynamo/dynamo. Accessed: 2025-09-01. 24

work page 2025
[24]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ´I˜nigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), pages 123–136, Toronto, Canada,

work page
[25]

RIPE Atlas: A global internet measurement network.https://atlas.ripe.net/, 2024

RIPE NCC. RIPE Atlas: A global internet measurement network.https://atlas.ripe.net/, 2024

work page 2024
[26]

BLOOM: A 176B-parameter open-access multilingual language model, 2023

Teven Le Scao, Angela Fan, Christopher Akiki, et al. BLOOM: A 176B-parameter open-access multilingual language model, 2023

work page 2023
[27]

Network emulation using linux network namespaces

Daniel Schubert, Benedikt Jaeger, and Max Helm. Network emulation using linux network namespaces. Network, 57, 2019

work page 2019
[28]

Network congestion-aware online service function chain placement and load balancing

Xiaojun Shang, Zhenhua Liu, and Yuanyuan Yang. Network congestion-aware online service function chain placement and load balancing. InProceedings of the 48th International Conference on Parallel Processing, 2019

work page 2019
[29]

Optimizing resource allocation for geographically- distributed inference by large language models.Performance Evaluation, 170:102527, 2025

Tingyang Sun, Ting He, Bo Ji, and Parimal Parag. Optimizing resource allocation for geographically- distributed inference by large language models.Performance Evaluation, 170:102527, 2025

work page 2025
[30]

Communication-

Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. Communication-efficient dis- tributed deep learning: A comprehensive survey.CoRR, abs/2003.06307, 2020

work page arXiv 2003
[31]

vllm: Distributed inference and serving

vLLM Project. vllm: Distributed inference and serving. https://docs.vllm.ai/en/stable/index.html. Accessed: 2025-09-01

work page 2025
[32]

Distributed join-the-idle-queue for low latency cloud services.IEEE/ACM Transactions on Networking, 26(5):2309–2319, 2018

Chunpu Wang, Chen Feng, and Julian Cheng. Distributed join-the-idle-queue for low latency cloud services.IEEE/ACM Transactions on Networking, 26(5):2309–2319, 2018

work page 2018
[33]

Wentao Weng, Xingyu Zhou, and R. Srikant. Optimal load balancing with locality constraints.Proc. ACM Meas. Anal. Comput. Syst., 4(3), 2020

work page 2020
[34]

Optimality of the shortest line discipline.Journal of Applied Probability, 14(1):181– 189, 1977

Wayne Winston. Optimality of the shortest line discipline.Journal of Applied Probability, 14(1):181– 189, 1977

work page 1977
[35]

Pipemare: Asynchronous pipeline parallel dnn training

Bowen Yang, Jian Zhang, Jonathan Li, et al. Pipemare: Asynchronous pipeline parallel dnn training. InProceedings of Machine Learning and Systems, pages 269–296, 2021

work page 2021
[36]

is there a feasible solution to (29) with objective value≥𝜇?

Qixia Zhang, Yikai Xiao, Fangming Liu, John C.S. Lui, Jian Guo, and Tao Wang. Joint optimization of chain placement and request scheduling for network function virtualization. InIEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 731–741, 2017. 25 Notation Description 𝐽 number of servers J,J + set/extended set of servers 𝑀 𝑗...

work page 2017

[1] [1]

Sprenger

A. Sprenger. Ray vllm inference.https://github.com/asprenger/ray_vllm_inference. Ac- cessed: 2025-09-01

work page 2025

[2] [2]

A study comparing waiting times in global and local queuing systems with heterogeneous workers.Applied Sciences, 14(9), 2024

Inessa Ainbinder, Evgeni Temnikov, and Miriam Allalouf. A study comparing waiting times in global and local queuing systems with heterogeneous workers.Applied Sciences, 14(9), 2024

work page 2024

[3] [3]

Amazon elastic kubernetes service.https://aws.amazon.com/eks/

Amazon Web Services. Amazon elastic kubernetes service.https://aws.amazon.com/eks/. Ac- cessed: 2025-09-01

work page 2025

[4] [4]

Demystifying parallel and distributed deep learning: An in-depth concurrency analysis.ACM Comput

Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis.ACM Comput. Surv., 52(4), Aug 2019

work page 2019

[5] [5]

Asymptotic optimality of speed-aware jsq for hetero- geneous service systems.Performance Evaluation, 157-158:102320, 2022

Sanidhay Bhambay and Arpan Mukhopadhyay. Asymptotic optimality of speed-aware jsq for hetero- geneous service systems.Performance Evaluation, 157-158:102320, 2022

work page 2022

[6] [6]

Distributed inference and fine-tuning of large language models over the internet

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, et al. Distributed inference and fine-tuning of large language models over the internet. InAdvances in Neural Information Processing Systems, volume 36, pages 12312–12331, 2023

work page 2023

[7] [7]

Placement and routing optimization problem for service function chain: State of art and future opportunities.CoRR, abs/1910.02613, 2019

Weihan Chen, Xia Yin, Zhiliang Wang, and Xingang Shi. Placement and routing optimization problem for service function chain: State of art and future opportunities.CoRR, abs/1910.02613, 2019

work page arXiv 1910

[8] [8]

Nvidia multi-instance gpu user guide.https://docs.nvidia.com/ datacenter/tesla/mig-user-guide/index.html, 2025

NVIDIA Corporation. Nvidia multi-instance gpu user guide.https://docs.nvidia.com/ datacenter/tesla/mig-user-guide/index.html, 2025. 23

work page 2025

[9] [9]

Gromicho, Jelke J

Joaquim A.S. Gromicho, Jelke J. van Hoorn, Francisco Saldanha da Gama, and Gerrit T. Timmer. Solving the job-shop scheduling problem optimally by dynamic programming.Computers & Operations Research, 39(12):2968–2977, 2012

work page 2012

[10] [10]

Joint placement and routing of network function chains in data centers

Linqi Guo, John Pang, and Anwar Walid. Joint placement and routing of network function chains in data centers. InIEEE INFOCOM 2018 - IEEE Conference on Computer Communications, pages 612–620, 2018

work page 2018

[11] [11]

Computing science: The easiest hard problem.American Scientist, 90(2):113–117, 2002

Brian Hayes. Computing science: The easiest hard problem.American Scientist, 90(2):113–117, 2002

work page 2002

[12] [12]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[13] [13]

Branching-aware service function placement and routing in network function virtualization

Maryam Jalalitabar, Yang Wang, and Xiaojun Cao. Branching-aware service function placement and routing in network function virtualization. InIEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), pages 1–6, 2019

work page 2019

[14] [14]

Joint optimization of service function placement and flow distribution for service function chaining.IEEE Journal on Selected Areas in Communications, 35(11):2532–2541, 2017

Insun Jang, Dongeun Suh, Sangheon Pack, and Gy ¨orgy D´an. Joint optimization of service function placement and flow distribution for service function chaining.IEEE Journal on Selected Areas in Communications, 35(11):2532–2541, 2017

work page 2017

[15] [15]

Springer Berlin Heidelberg, 2004

Hans Kellerer, Ulrich Pferschy, and David Pisinger.Multidimensional Knapsack Problems, pages 235–283. Springer Berlin Heidelberg, 2004

work page 2004

[16] [16]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolu- tional neural networks.Communications of the ACM, 60(6):84–90, May 2017

work page 2017

[17] [17]

Larus, and Albert Greenberg

Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R. Larus, and Albert Greenberg. Join- idle-queue: A novel load balancing algorithm for dynamically scalable web services.Performance Evaluation, 68(11):1056–1071, 2011

work page 2011

[18] [18]

Traffic aware placement of interdependent nfv middleboxes

Wenrui Ma, Oscar Sandoval, Jonathan Beltran, Deng Pan, and Niki Pissinou. Traffic aware placement of interdependent nfv middleboxes. InIEEE Conference on Computer Communications (INFOCOM), pages 1–9, 2017

work page 2017

[19] [19]

Medhat, Tarik Taleb, Asma Elmangoush, et al

Ahmed M. Medhat, Tarik Taleb, Asma Elmangoush, et al. Service function chaining in next generation networks: State of the art and research challenges.IEEE Communications Magazine, 55(2):216–223, 2017

work page 2017

[20] [20]

Load balancing and density dependent jump markov processes

Michael Mitzenmacher. Load balancing and density dependent jump markov processes. InProceedings of the 37th Annual Symposium on Foundations of Computer Science, page 213, 1996

work page 1996

[21] [21]

Queueing, predictions, and large language models: Chal- lenges and open problems.Stochastic Systems, 15(3):195–219, 2025

Michael Mitzenmacher and Rana Shahout. Queueing, predictions, and large language models: Chal- lenges and open problems.Stochastic Systems, 15(3):195–219, 2025

work page 2025

[22] [22]

Pipedream: generalized pipeline paral- lelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, et al. Pipedream: generalized pipeline paral- lelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, page 1–15, 2019

work page 2019

[23] [23]

A datacenter scale distributed inference serving framework

NVIDIA. A datacenter scale distributed inference serving framework. https://github.com/ai-dynamo/dynamo. Accessed: 2025-09-01. 24

work page 2025

[24] [24]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ´I˜nigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), pages 123–136, Toronto, Canada,

work page

[25] [25]

RIPE Atlas: A global internet measurement network.https://atlas.ripe.net/, 2024

RIPE NCC. RIPE Atlas: A global internet measurement network.https://atlas.ripe.net/, 2024

work page 2024

[26] [26]

BLOOM: A 176B-parameter open-access multilingual language model, 2023

Teven Le Scao, Angela Fan, Christopher Akiki, et al. BLOOM: A 176B-parameter open-access multilingual language model, 2023

work page 2023

[27] [27]

Network emulation using linux network namespaces

Daniel Schubert, Benedikt Jaeger, and Max Helm. Network emulation using linux network namespaces. Network, 57, 2019

work page 2019

[28] [28]

Network congestion-aware online service function chain placement and load balancing

Xiaojun Shang, Zhenhua Liu, and Yuanyuan Yang. Network congestion-aware online service function chain placement and load balancing. InProceedings of the 48th International Conference on Parallel Processing, 2019

work page 2019

[29] [29]

Optimizing resource allocation for geographically- distributed inference by large language models.Performance Evaluation, 170:102527, 2025

Tingyang Sun, Ting He, Bo Ji, and Parimal Parag. Optimizing resource allocation for geographically- distributed inference by large language models.Performance Evaluation, 170:102527, 2025

work page 2025

[30] [30]

Communication-

Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. Communication-efficient dis- tributed deep learning: A comprehensive survey.CoRR, abs/2003.06307, 2020

work page arXiv 2003

[31] [31]

vllm: Distributed inference and serving

vLLM Project. vllm: Distributed inference and serving. https://docs.vllm.ai/en/stable/index.html. Accessed: 2025-09-01

work page 2025

[32] [32]

Distributed join-the-idle-queue for low latency cloud services.IEEE/ACM Transactions on Networking, 26(5):2309–2319, 2018

Chunpu Wang, Chen Feng, and Julian Cheng. Distributed join-the-idle-queue for low latency cloud services.IEEE/ACM Transactions on Networking, 26(5):2309–2319, 2018

work page 2018

[33] [33]

Wentao Weng, Xingyu Zhou, and R. Srikant. Optimal load balancing with locality constraints.Proc. ACM Meas. Anal. Comput. Syst., 4(3), 2020

work page 2020

[34] [34]

Optimality of the shortest line discipline.Journal of Applied Probability, 14(1):181– 189, 1977

Wayne Winston. Optimality of the shortest line discipline.Journal of Applied Probability, 14(1):181– 189, 1977

work page 1977

[35] [35]

Pipemare: Asynchronous pipeline parallel dnn training

Bowen Yang, Jian Zhang, Jonathan Li, et al. Pipemare: Asynchronous pipeline parallel dnn training. InProceedings of Machine Learning and Systems, pages 269–296, 2021

work page 2021

[36] [36]

is there a feasible solution to (29) with objective value≥𝜇?

Qixia Zhang, Yikai Xiao, Fangming Liu, John C.S. Lui, Jian Guo, and Tao Wang. Joint optimization of chain placement and request scheduling for network function virtualization. InIEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 731–741, 2017. 25 Notation Description 𝐽 number of servers J,J + set/extended set of servers 𝑀 𝑗...

work page 2017