pith. sign in

arxiv: 2512.22142 · v2 · submitted 2025-12-13 · 💻 cs.DC · cs.LG

On Harnessing Idle Compute at the Edge for Foundation Model Training

Pith reviewed 2026-05-16 22:26 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords edge computingfoundation model trainingGEMM operationsparameter serverdevice heterogeneitydistributed trainingfault tolerance
0
0 comments X

The pith

Cleave trains foundation models on edge devices by exploiting GEMM's asymmetric I/O pattern to reach cloud-comparable speeds while scaling to thousands of heterogeneous nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cleave as a system that harnesses spare compute on edge devices for foundation model training, addressing the centralization of current approaches. It builds on the insight that GEMM operations send large inputs over downlink and small outputs over uplink, which aligns with the 2-10x bandwidth asymmetry common in edge networks. A parameter-server architecture combined with decomposition into independent sub-GEMM tasks reduces per-device communication as scale increases and provides a single mechanism for memory limits, overhead control, and recovery under device churn. Evaluation shows this delivers cloud-like per-batch times and outperforms prior edge methods by 4-10x at equivalent device counts.

Core claim

Cleave achieves cloud-comparable GPU training performance by aligning GEMM operations with edge network bandwidth asymmetries in a parameter-server architecture, allowing per-device communication to decrease with scale, and scales to thousands of heterogeneous devices with at least 100x faster failure recovery than prior systems.

What carries the argument

Parameter-server-centric architecture that decomposes training into independent sub-GEMM tasks to unify memory constraints, communication overhead, and fault tolerance under device churn.

Load-bearing premise

The asymmetric I/O pattern of GEMM operations can be exploited at scale on real edge networks without hidden overheads from memory fragmentation, synchronization, or network variability that would erase the claimed speedups.

What would settle it

Deploy Cleave on a large real-world testbed of heterogeneous edge devices with measured variable network conditions and check whether the 4-10x runtime gains and 100x faster recovery times hold compared to baselines.

Figures

Figures reproduced from arXiv: 2512.22142 by Amos Storkey, Leyang Xue, Mahesh K. Marina, Meghana Madhyastha, Myungjin Lee, Randal Burns.

Figure 1
Figure 1. Figure 1: The per-device communication volume when train [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The workflow of Cleave from model defined in training script to DAG of GEMMs. Edges in the DAG rep￾resents the memory dependency. Each GEMM is scheduled selectively across devices with best effort communication and computation overlap. (DAG) of GEMM operations [72, 31, 58], as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Normalized training latency for a batch (lower the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized training latency for a batch with OPT [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Latency performance under increasing stragglers, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Batch runtime of OPT-13B when scaling up the [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Batch runtime when scaling up model size propor [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Batch runtime of OPT-13B when scaling up batch [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

The foundation-model ecosystem remains highly centralized because training requires immense compute resources and is therefore largely limited to large cloud operators. Edge-assisted foundation model training that harnesses spare compute on edge devices offers a more democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-training performance, scale to larger models, fit within device memory limits, or keep communication overhead manageable. They also do not handle device heterogeneity and churn satisfactorily. We introduce Cleave, built on a structural insight: each GEMM has an asymmetric I/O pattern -- its input matrices, sent over downlink, are much larger than the partial output blocks returned over uplink -- matching edge networks where downlink bandwidth exceeds uplink by 2--10x. Exploiting this alignment with a parameter-server-centric architecture, Cleave makes per-device communication \emph{decrease} as more devices join, rather than stay constant as in conventional TP. Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn. Our evaluation shows that Cleave achieves cloud-comparable GPU training performance and outperforms state-of-the-art edge-training methods by 4--10x in per-batch runtime at the same device counts. Beyond this shared operating range, Cleave scales to thousands of heterogeneous devices -- a regime where prior edge-training systems cannot operate -- and achieves at least 100x faster recovery from device failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Cleave, a parameter-server-centric system for edge-assisted foundation model training. It exploits the asymmetric I/O pattern of GEMM operations (large downlink inputs, small uplink partial outputs) to make per-device communication decrease with scale, decomposes training into independent sub-GEMM tasks, and uses a unified scheduler that jointly handles memory constraints, communication, and fault tolerance under device heterogeneity and churn. The evaluation claims cloud-comparable GPU performance, 4-10x per-batch runtime improvement over prior edge-training methods at the same device counts, scalability to thousands of heterogeneous devices, and at least 100x faster recovery from device failures.

Significance. If the performance and scaling claims hold under realistic conditions, the work could meaningfully advance democratized foundation-model training by utilizing idle edge resources. The alignment of GEMM asymmetry with typical edge network bandwidth ratios and the single scheduling abstraction for memory/communication/fault-tolerance are potentially valuable contributions to distributed ML systems.

major comments (3)
  1. [Abstract] Abstract: performance numbers (4-10x runtime, 100x recovery) and scaling claims to thousands of devices are stated without any reference to experimental setup, baselines, error bars, or how device heterogeneity and churn were modeled, leaving the central claims unsupported by visible evidence.
  2. [Evaluation] Evaluation section: no quantitative breakdown of communication volume versus device count or failure rate is provided, so the claim that per-device traffic decreases with scale (and the extrapolation beyond small-scale tests) cannot be assessed.
  3. [Architecture] Architecture and scheduling description: the assumption that sub-GEMM decomposition plus unified scheduling incurs no hidden synchronization or rescheduling overhead under churn and network variability is load-bearing for the 4-10x and 100x claims, yet no measurements or analysis of these overheads are shown.
minor comments (1)
  1. [Abstract] The phrase 'cloud-comparable GPU training performance' is imprecise; specify the exact metrics, model sizes, and cloud baseline used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions that will strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance numbers (4-10x runtime, 100x recovery) and scaling claims to thousands of devices are stated without any reference to experimental setup, baselines, error bars, or how device heterogeneity and churn were modeled, leaving the central claims unsupported by visible evidence.

    Authors: We agree the abstract would be improved by explicit pointers to supporting details. In the revised version we will append a brief clause directing readers to Section 5, where the experimental setup (including trace-driven modeling of heterogeneity and churn), baselines, and error bars are fully described. The reported numbers derive from those experiments. revision: yes

  2. Referee: [Evaluation] Evaluation section: no quantitative breakdown of communication volume versus device count or failure rate is provided, so the claim that per-device traffic decreases with scale (and the extrapolation beyond small-scale tests) cannot be assessed.

    Authors: The evaluation section reports aggregate communication costs but lacks the requested per-device breakdown. We will add a new figure and accompanying text in Section 5 that plots uplink and downlink volume per device as functions of device count (10–2000) and failure rate (0–20 %), confirming the decrease predicted by the GEMM asymmetry and supporting the scaling extrapolation. revision: yes

  3. Referee: [Architecture] Architecture and scheduling description: the assumption that sub-GEMM decomposition plus unified scheduling incurs no hidden synchronization or rescheduling overhead under churn and network variability is load-bearing for the 4-10x and 100x claims, yet no measurements or analysis of these overheads are shown.

    Authors: We collected these overhead measurements during our experiments but did not isolate them in the text. The revised architecture section will include a dedicated analysis and microbenchmark results showing that synchronization and rescheduling overheads remain below 5 % of runtime even at 15 % churn and under realistic network variability, thereby substantiating the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture insight and scaling claims rest on empirical evaluation without self-referential derivations or fitted predictions

full rationale

The provided manuscript text contains no equations, parameter fits, or mathematical derivations. The core claim (asymmetric GEMM I/O exploited via parameter-server decomposition to reduce per-device communication with scale) is presented as a structural observation matched to edge network properties, followed by system implementation and evaluation results. No self-citations are used to justify uniqueness theorems or to smuggle ansatzes. The 4-10x and 100x recovery claims are tied to reported measurements rather than reducing by construction to inputs. This is a standard systems paper whose central results are externally falsifiable via replication on hardware; no load-bearing step collapses to a self-definition or renamed fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems paper; the abstract contains no explicit free parameters, mathematical axioms, or newly invented entities. All claims rest on the architectural insight and evaluation results.

pith-pipeline@v0.9.0 · 5581 in / 1135 out tokens · 30522 ms · 2026-05-16T22:26:38.214648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 6 internal anchors

  1. [1]

    Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication.ACM Trans. Parallel Comput., 4, 3, 13:1–13:34. Leyang Xue†, Meghana Madhyastha ‡, Myungjin Lee ⋄ , Amos Storkey †, Randal Burns ‡ and Mahesh K. Marina †

  2. [2]

    Backlinko. 2023. Smartphone usage statistics. https://backlinko.com/smartpho ne-usage-statistics. Accessed: 2024-07-28. (2023)

  3. [3]

    Bartoldson, Bhavya Kailkhura, and Davis W

    Brian R. Bartoldson, Bhavya Kailkhura, and Davis W. Blalock. 2023. Compute- efficient deep learning: algorithmic trends and opportunities.J. Mach. Learn. Res., 24, 122:1–122:77

  4. [4]

    Giovanni Bartolomeo, Mehdi Yosofie, Simon Bäurle, Oliver Haluszczynski, Nitinder Mohan, and Jörg Ott. 2023. Oakestra: A lightweight hierarchical orchestration framework for edge computing. InUSENIX ATC. USENIX Asso- ciation, 215–231

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. 2021. On the opportunities and risks of foundation models. (2021). arXiv: 2108.07258

  6. [6]

    S Boucheron, G Lugosi, and P Massart. 2013. Concentration inequalities: a nonasymptotic theory of independence oxford, uk: oxford univ. (2013)

  7. [7]

    BT. 2024. Broadband deals. https://www.bt.com/broadband/deals. (2024)

  8. [8]

    Jiasi Chen and Xukan Ran. 2019. Deep learning with edge computing: A review. Proc. IEEE, 107, 8, 1655–1674

  9. [9]

    Shenggan Cheng, Ziming Liu, Jiangsu Du, and Yang You. 2023. ATP: adaptive tensor parallelism for foundation models. (2023). arXiv: 2301.08658

  10. [10]

    2004.Order statistics

    Herbert A David and Haikady N Nagaraja. 2004.Order statistics. John Wiley & Sons

  11. [11]

    L de Haan and A Ferreira. 2006. Extreme value theory: an introduction springer science+ business media.LLC, New York

  12. [12]

    Michael Diskin et al. 2021. Distributed deep learning in open collaborations. In NeurIPS, 7879–7897

  13. [13]

    Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A systematic survey of general sparse matrix-matrix multiplication.ACM Comput. Surv., 55, 12, 244:1–244:36

  14. [14]

    GitHub. 2021. GitHub Copilot·Your AI pair programmer. https://github.com/f eatures/copilot. Accessed: 2024-05-17. (2021)

  15. [15]

    Google. 2024. gRPC – an RPC library and framework. https://github.com/grpc /grpc. Accessed: 2024-05-17. (2024)

  16. [16]

    Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies.SIAM journal on Applied Mathematics, 17, 2, 416–429

  17. [17]

    Gurobi Optimization, LLC. 2024. Gurobi Optimizer Reference Manual. (2024). https://www.gurobi.com

  18. [18]

    Pengzhan Hao and Yifan Zhang. 2021. EDDL: A distributed deep learning system for resource-limited edge computing environment. InSEC. IEEE, 1–13

  19. [19]

    Hennessy and David A

    John L. Hennessy and David A. Patterson. 2012.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann

  20. [20]

    Junxian Huang, Feng Qian, Alexandre Gerber, Zhuoqing Morley Mao, Sub- habrata Sen, and Oliver Spatscheck. 2012. A close examination of performance and power characteristics of 4g LTE networks. InMobiSys. ACM, 225–238

  21. [21]

    Yanping Huang et al. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InNeurIPS, 103–112

  22. [22]

    Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: running deep neural networks on android smartphones. InECCV Workshops (5)(Lecture Notes in Computer Science). Vol. 11133. Springer, 288–314

  23. [24]

    Jared Kaplan et al. 2020. Scaling laws for neural language models. (2020). arXiv: 2001.08361

  24. [25]

    Rupesh Khendry. 2023. The era of generative AI: driving transformation in capital markets. https://www.microsoft.com/en-us/industry/blog/financial-se rvices/2023/07/10/the-era-of-generative-ai-driving-transformation-in-capit al-markets/. Accessed: 2024-05-17. (2023)

  25. [26]

    KubeEdge. 2024. Kubernetes native edge computing framework. https://kubee dge.io/. (2024)

  26. [27]

    Madhyastha, and Mosharaf Chowdhury

    Fan Lai, Yinwei Dai, Sanjay Sri Vallabh Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. 2022. FedScale: bench- marking model and system performance of federated learning at scale. InICML (Proceedings of Machine Learning Research). Vol. 162. PMLR, 11814–11827

  27. [28]

    2012.Extremes and related properties of random sequences and processes

    Malcolm R Leadbetter, Georg Lindgren, and Holger Rootzén. 2012.Extremes and related properties of random sequences and processes. Springer Science & Business Media

  28. [29]

    Papailiopoulos, and Kannan Ramchandran

    Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris S. Papailiopoulos, and Kannan Ramchandran. 2018. Speeding up distributed machine learning using codes.IEEE Trans. Inf. Theory, 64, 3, 1514–1529

  29. [30]

    Jan Karel Lenstra, David B Shmoys, and Éva Tardos. 1990. Approximation algo- rithms for scheduling unrelated parallel machines.Mathematical programming, 46, 1, 259–271

  30. [31]

    Xing, and Hao Zhang

    Dacheng Li, Hongyi Wang, Eric P. Xing, and Hao Zhang. 2022. AMP: auto- matically finding model parallel strategies with heterogeneity awareness. In NeurIPS

  31. [32]

    Andersen, Jun Woo Park, Alexander J

    Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. InOSDI. USENIX Association, 583–598

  32. [33]

    Xiangyu Li, Yuanchun Li, Yuanzhe Li, Ting Cao, and Yunxin Liu. 2024. FlexNN: efficient and adaptive DNN inference on memory-constrained edge devices. In MobiCom. ACM, 709–723

  33. [34]

    Weijian Liu, Mingzhen Li, Guangming Tan, and Weile Jia. 2025. Mario: near zero-cost activation checkpointing in pipeline parallelism. InPPoPP. ACM, 197–211

  34. [35]

    Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret

  35. [36]

    Applications of second-order cone programming.Linear algebra and its applications, 284, 1-3, 193–228

  36. [37]

    M-Lab. 2021. The M-Lab MobiPerf dataset. https://measurementlab.net/tests /mobiperf. Accessed: 2024-10-17. (2021)

  37. [38]

    Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. SDPipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training.Proc. VLDB Endow., 16, 9, 2354–2363

  38. [39]

    Rajeev Motwani and Prabhakar Raghavan. 1996. Randomized algorithms.ACM Comput. Surv., 28, 1, 33–37

  39. [40]

    Devanur, Gregory R

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InSOSP. ACM, 1–15

  40. [41]

    Ashkan Nikravesh, Yihua Guo, Feng Qian, Zhuoqing Morley Mao, and Sub- habrata Sen. 2016. An in-depth understanding of multipath TCP on mobile devices: measurement and system design. InMobiCom. ACM, 189–201

  41. [42]

    OASIS. 2019. Mqtt version 5.0.Retrieved June, 22, 2020, 1435

  42. [43]

    OASIS. 2012. Oasis advanced message queuing protocol (amqp) version 1.0. International Journal of Aerospace Engineering Hindawi www. hindawi. com, 2018

  43. [44]

    OpenAI. 2023. GPT-4 technical report. (2023). arXiv: 2303.08774

  44. [45]

    Park, Gyeongchan Yun, Chang M

    Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H. Noh, and Young-ri Choi. 2020. HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. InUSENIX ATC. USENIX Association, 307–321

  45. [46]

    Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu

    David Patterson, Jeffrey M. Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu. 2024. Energy and emissions of machine learning on smartphones vs. the cloud.Commun. ACM, 67, 2, 86–97

  46. [47]

    Shixiong Qi, K. K. Ramakrishnan, and Myungjin Lee. 2024. LIFL: A lightweight, event-driven serverless platform for federated learning. InMLSys. mlsys.org

  47. [48]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. InKDD. ACM, 3505–3506

  48. [49]

    R Tyrrell Rockafellar, Stanislav Uryasev, et al. 2000. Optimization of conditional value-at-risk.Journal of risk, 2, 21–42

  49. [50]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. InCVPR. IEEE, 10674–10685

  50. [51]

    DJ Russo, B Van Roy, A Kazerouni, I Osband, Z Wen, et al. 2018. A tutorial on thompson sampling. foundations and trends®in machine learning 11 (1): 1–96. (2018)

  51. [52]

    Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. 2023. SWARM parallelism: training large models can be surprisingly communication- efficient. InICML(Proceedings of Machine Learning Research). Vol. 202. PMLR, 29416–29440

  52. [53]

    Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhi- menko. 2021. Moshpit SGD: communication-efficient decentralized training on heterogeneous unreliable devices. InNeurIPS, 18195–18211

  53. [54]

    Max Ryabinin and Anton Gusev. 2020. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. InNeurIPS

  54. [55]

    Lorenzo Sani et al. 2025. Photon: federated LLM pre-training. InMLSys. ml- sys.org

  55. [56]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: training multi-billion pa- rameter language models using model parallelism. (2019). arXiv: 1909.08053

  56. [57]

    Craig S. Smith. 2023. What large models cost you – there is no free ai lunch. https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cos t-you--there-is-no-free-ai-lunch/?sh=2b6d10724af7. (Sept. 2023)

  57. [58]

    SPEEDTEST. 2025. Speed test global index. https://www.speedtest.net/global-i ndex/united-states. Accessed: 2025-01-27. (2025)

  58. [59]

    Zhenheng Tang et al. 2023. FusionAI: decentralized training and deploying LLMs with massive consumer-level GPUs. (2023). arXiv: 2309.01172

  59. [60]

    Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH.Int. J. High Perform. Comput. Appl., 19, 1, 49–66

  60. [61]

    Chandra Thapa, Mahawaga Arachchige Pathum Chamikara, Seyit Camtepe, and Lichao Sun. 2022. SplitFed: when federated learning meets split learning. InAAAI. AAAI Press, 8485–8493. On Harnessing Idle Compute at the Edge for Foundation Model Training

  61. [62]

    John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Min- jia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: making preemptible instances resilient for affordable training of large dnns. InNSDI. USENIX Association, 497–513

  62. [63]

    Hugo Touvron, Louis Martin, Kevin Stone, and et al. 2023. Llama 2: open foundation and fine-tuned chat models. (2023). arXiv: 2307.09288

  63. [64]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InNIPS, 5998–6008

  64. [65]

    Pietzuch

    Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter R. Pietzuch. 2024. Tenplex: dynamic parallelism for deep learning using parallelizable tensor collections. InSOSP. ACM, 195–210

  65. [66]

    Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin Wang. 2024. NetLLM: adapting large language models for networking. InSIGCOMM. ACM, 661–678

  66. [67]

    Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, and Luo Mai. 2025. MoE-Gen: high-throughput MoE inference on a single gpu with module-based batching. (2025). arXiv: 2503.09716

  67. [68]

    Leyang Xue et al. 2025. Towards decentralized and sustainable foundation model training with the edge.ACM SIGENERGY Energy Informatics Review, 5, 2, 1–9

  68. [69]

    Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, and Xu Chen

  69. [70]

    InMobiCom

    Asteroid: resource-efficient hybrid pipeline parallelism for collaborative DNN training on heterogeneous edge devices. InMobiCom. ACM, 312–326

  70. [71]

    Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Ré, and Ce Zhang. 2022. Decentralized training of foundation models in heterogeneous environments. InNeurIPS

  71. [72]

    Haoran Zhang, Adney Cardoza, Peter Baile Chen, Sebastian Angel, and Vincent Liu. 2020. Fault-tolerant and transactional stateful serverless workflows. In OSDI. USENIX Association, 1187–1204

  72. [73]

    Susan Zhang, Stephen Roller, Naman Goyal, and et al. 2022. OPT: open pre- trained transformer language models. (2022). arXiv: 2205.01068

  73. [74]

    Lianmin Zheng et al. 2022. Alpa: automating inter- and intra-operator paral- lelism for distributed deep learning. InOSDI. USENIX Association, 559–578. A Communication Efficiency: Homogeneous We analyze the per-device communication volume and derive con- ditions under whichCleaveachieves superior communication ef- ficiency compared to conventional paralle...