On Harnessing Idle Compute at the Edge for Foundation Model Training

Amos Storkey; Leyang Xue; Mahesh K. Marina; Meghana Madhyastha; Myungjin Lee; Randal Burns

arxiv: 2512.22142 · v2 · submitted 2025-12-13 · 💻 cs.DC · cs.LG

On Harnessing Idle Compute at the Edge for Foundation Model Training

Leyang Xue , Meghana Madhyastha , Myungjin Lee , Amos Storkey , Randal Burns , Mahesh K. Marina This is my paper

Pith reviewed 2026-05-16 22:26 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords edge computingfoundation model trainingGEMM operationsparameter serverdevice heterogeneitydistributed trainingfault tolerance

0 comments

The pith

Cleave trains foundation models on edge devices by exploiting GEMM's asymmetric I/O pattern to reach cloud-comparable speeds while scaling to thousands of heterogeneous nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cleave as a system that harnesses spare compute on edge devices for foundation model training, addressing the centralization of current approaches. It builds on the insight that GEMM operations send large inputs over downlink and small outputs over uplink, which aligns with the 2-10x bandwidth asymmetry common in edge networks. A parameter-server architecture combined with decomposition into independent sub-GEMM tasks reduces per-device communication as scale increases and provides a single mechanism for memory limits, overhead control, and recovery under device churn. Evaluation shows this delivers cloud-like per-batch times and outperforms prior edge methods by 4-10x at equivalent device counts.

Core claim

Cleave achieves cloud-comparable GPU training performance by aligning GEMM operations with edge network bandwidth asymmetries in a parameter-server architecture, allowing per-device communication to decrease with scale, and scales to thousands of heterogeneous devices with at least 100x faster failure recovery than prior systems.

What carries the argument

Parameter-server-centric architecture that decomposes training into independent sub-GEMM tasks to unify memory constraints, communication overhead, and fault tolerance under device churn.

Load-bearing premise

The asymmetric I/O pattern of GEMM operations can be exploited at scale on real edge networks without hidden overheads from memory fragmentation, synchronization, or network variability that would erase the claimed speedups.

What would settle it

Deploy Cleave on a large real-world testbed of heterogeneous edge devices with measured variable network conditions and check whether the 4-10x runtime gains and 100x faster recovery times hold compared to baselines.

Figures

Figures reproduced from arXiv: 2512.22142 by Amos Storkey, Leyang Xue, Mahesh K. Marina, Meghana Madhyastha, Myungjin Lee, Randal Burns.

**Figure 2.** Figure 2: The workflow of Cleave from model defined in training script to DAG of GEMMs. Edges in the DAG represents the memory dependency. Each GEMM is scheduled selectively across devices with best effort communication and computation overlap. (DAG) of GEMM operations [72, 31, 58], as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Normalized training latency for a batch (lower the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized training latency for a batch with OPT [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Latency performance under increasing stragglers, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Batch runtime of OPT-13B when scaling up the [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Batch runtime when scaling up model size propor [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Batch runtime of OPT-13B when scaling up batch [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

The foundation-model ecosystem remains highly centralized because training requires immense compute resources and is therefore largely limited to large cloud operators. Edge-assisted foundation model training that harnesses spare compute on edge devices offers a more democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-training performance, scale to larger models, fit within device memory limits, or keep communication overhead manageable. They also do not handle device heterogeneity and churn satisfactorily. We introduce Cleave, built on a structural insight: each GEMM has an asymmetric I/O pattern -- its input matrices, sent over downlink, are much larger than the partial output blocks returned over uplink -- matching edge networks where downlink bandwidth exceeds uplink by 2--10x. Exploiting this alignment with a parameter-server-centric architecture, Cleave makes per-device communication \emph{decrease} as more devices join, rather than stay constant as in conventional TP. Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn. Our evaluation shows that Cleave achieves cloud-comparable GPU training performance and outperforms state-of-the-art edge-training methods by 4--10x in per-batch runtime at the same device counts. Beyond this shared operating range, Cleave scales to thousands of heterogeneous devices -- a regime where prior edge-training systems cannot operate -- and achieves at least 100x faster recovery from device failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cleave's parameter-server design exploits GEMM I/O asymmetry to reduce per-device communication with scale on edge networks, but the large-scale claims need concrete traffic and overhead data to hold up.

read the letter

The main contribution is a scheduling approach that decomposes training into sub-GEMM tasks and uses a parameter server to take advantage of downlink-heavy edge links. This lets communication per device drop as the system grows, which is the opposite of usual tensor parallel setups. They also fold memory limits, comm costs, and fault handling into one abstraction, which simplifies dealing with churn and heterogeneity. The paper is clear on the limitations of prior edge training work and shows how their design targets those gaps. The performance numbers they report against cloud baselines and other edge systems are the kind of concrete comparison that helps. If the full experiments confirm the scaling to thousands of devices and the fast recovery, it would be a useful data point for anyone trying to use distributed edge resources. Where it is thinner is on the details of the scaling behavior. The stress test correctly flags that without numbers on how communication volume changes with device count or how rescheduling costs behave under churn, the extrapolation from small clusters to thousands of devices is hard to assess. Minor issues like missing error bars or exact baselines would be easy to fix, but the core scaling argument needs that evidence. This is for distributed systems people working on ML training infrastructure. Readers who care about making training more accessible beyond big clouds will get value from the ideas even if they want to see more measurements. It deserves peer review because the approach is distinct and the problem matters, though revisions will likely focus on strengthening the large-scale evaluation.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Cleave, a parameter-server-centric system for edge-assisted foundation model training. It exploits the asymmetric I/O pattern of GEMM operations (large downlink inputs, small uplink partial outputs) to make per-device communication decrease with scale, decomposes training into independent sub-GEMM tasks, and uses a unified scheduler that jointly handles memory constraints, communication, and fault tolerance under device heterogeneity and churn. The evaluation claims cloud-comparable GPU performance, 4-10x per-batch runtime improvement over prior edge-training methods at the same device counts, scalability to thousands of heterogeneous devices, and at least 100x faster recovery from device failures.

Significance. If the performance and scaling claims hold under realistic conditions, the work could meaningfully advance democratized foundation-model training by utilizing idle edge resources. The alignment of GEMM asymmetry with typical edge network bandwidth ratios and the single scheduling abstraction for memory/communication/fault-tolerance are potentially valuable contributions to distributed ML systems.

major comments (3)

[Abstract] Abstract: performance numbers (4-10x runtime, 100x recovery) and scaling claims to thousands of devices are stated without any reference to experimental setup, baselines, error bars, or how device heterogeneity and churn were modeled, leaving the central claims unsupported by visible evidence.
[Evaluation] Evaluation section: no quantitative breakdown of communication volume versus device count or failure rate is provided, so the claim that per-device traffic decreases with scale (and the extrapolation beyond small-scale tests) cannot be assessed.
[Architecture] Architecture and scheduling description: the assumption that sub-GEMM decomposition plus unified scheduling incurs no hidden synchronization or rescheduling overhead under churn and network variability is load-bearing for the 4-10x and 100x claims, yet no measurements or analysis of these overheads are shown.

minor comments (1)

[Abstract] The phrase 'cloud-comparable GPU training performance' is imprecise; specify the exact metrics, model sizes, and cloud baseline used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions that will strengthen the presentation of our results and claims.

read point-by-point responses

Referee: [Abstract] Abstract: performance numbers (4-10x runtime, 100x recovery) and scaling claims to thousands of devices are stated without any reference to experimental setup, baselines, error bars, or how device heterogeneity and churn were modeled, leaving the central claims unsupported by visible evidence.

Authors: We agree the abstract would be improved by explicit pointers to supporting details. In the revised version we will append a brief clause directing readers to Section 5, where the experimental setup (including trace-driven modeling of heterogeneity and churn), baselines, and error bars are fully described. The reported numbers derive from those experiments. revision: yes
Referee: [Evaluation] Evaluation section: no quantitative breakdown of communication volume versus device count or failure rate is provided, so the claim that per-device traffic decreases with scale (and the extrapolation beyond small-scale tests) cannot be assessed.

Authors: The evaluation section reports aggregate communication costs but lacks the requested per-device breakdown. We will add a new figure and accompanying text in Section 5 that plots uplink and downlink volume per device as functions of device count (10–2000) and failure rate (0–20 %), confirming the decrease predicted by the GEMM asymmetry and supporting the scaling extrapolation. revision: yes
Referee: [Architecture] Architecture and scheduling description: the assumption that sub-GEMM decomposition plus unified scheduling incurs no hidden synchronization or rescheduling overhead under churn and network variability is load-bearing for the 4-10x and 100x claims, yet no measurements or analysis of these overheads are shown.

Authors: We collected these overhead measurements during our experiments but did not isolate them in the text. The revised architecture section will include a dedicated analysis and microbenchmark results showing that synchronization and rescheduling overheads remain below 5 % of runtime even at 15 % churn and under realistic network variability, thereby substantiating the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture insight and scaling claims rest on empirical evaluation without self-referential derivations or fitted predictions

full rationale

The provided manuscript text contains no equations, parameter fits, or mathematical derivations. The core claim (asymmetric GEMM I/O exploited via parameter-server decomposition to reduce per-device communication with scale) is presented as a structural observation matched to edge network properties, followed by system implementation and evaluation results. No self-citations are used to justify uniqueness theorems or to smuggle ansatzes. The 4-10x and 100x recovery claims are tied to reported measurements rather than reducing by construction to inputs. This is a standard systems paper whose central results are externally falsifiable via replication on hardware; no load-bearing step collapses to a self-definition or renamed fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems paper; the abstract contains no explicit free parameters, mathematical axioms, or newly invented entities. All claims rest on the architectural insight and evaluation results.

pith-pipeline@v0.9.0 · 5581 in / 1135 out tokens · 30522 ms · 2026-05-16T22:26:38.214648+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cleave achieves cloud-comparable GPU training performance... scales to thousands of heterogeneous devices

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 6 internal anchors

[1]

Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication.ACM Trans. Parallel Comput., 4, 3, 13:1–13:34. Leyang Xue†, Meghana Madhyastha ‡, Myungjin Lee ⋄ , Amos Storkey †, Randal Burns ‡ and Mahesh K. Marina †

work page 2018
[2]

Backlinko. 2023. Smartphone usage statistics. https://backlinko.com/smartpho ne-usage-statistics. Accessed: 2024-07-28. (2023)

work page 2023
[3]

Bartoldson, Bhavya Kailkhura, and Davis W

Brian R. Bartoldson, Bhavya Kailkhura, and Davis W. Blalock. 2023. Compute- efficient deep learning: algorithmic trends and opportunities.J. Mach. Learn. Res., 24, 122:1–122:77

work page 2023
[4]

Giovanni Bartolomeo, Mehdi Yosofie, Simon Bäurle, Oliver Haluszczynski, Nitinder Mohan, and Jörg Ott. 2023. Oakestra: A lightweight hierarchical orchestration framework for edge computing. InUSENIX ATC. USENIX Asso- ciation, 215–231

work page 2023
[5]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. 2021. On the opportunities and risks of foundation models. (2021). arXiv: 2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

S Boucheron, G Lugosi, and P Massart. 2013. Concentration inequalities: a nonasymptotic theory of independence oxford, uk: oxford univ. (2013)

work page 2013
[7]

BT. 2024. Broadband deals. https://www.bt.com/broadband/deals. (2024)

work page 2024
[8]

Jiasi Chen and Xukan Ran. 2019. Deep learning with edge computing: A review. Proc. IEEE, 107, 8, 1655–1674

work page 2019
[9]

Shenggan Cheng, Ziming Liu, Jiangsu Du, and Yang You. 2023. ATP: adaptive tensor parallelism for foundation models. (2023). arXiv: 2301.08658

work page arXiv 2023
[10]

2004.Order statistics

Herbert A David and Haikady N Nagaraja. 2004.Order statistics. John Wiley & Sons

work page 2004
[11]

L de Haan and A Ferreira. 2006. Extreme value theory: an introduction springer science+ business media.LLC, New York

work page 2006
[12]

Michael Diskin et al. 2021. Distributed deep learning in open collaborations. In NeurIPS, 7879–7897

work page 2021
[13]

Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A systematic survey of general sparse matrix-matrix multiplication.ACM Comput. Surv., 55, 12, 244:1–244:36

work page 2023
[14]

GitHub. 2021. GitHub Copilot·Your AI pair programmer. https://github.com/f eatures/copilot. Accessed: 2024-05-17. (2021)

work page 2021
[15]

Google. 2024. gRPC – an RPC library and framework. https://github.com/grpc /grpc. Accessed: 2024-05-17. (2024)

work page 2024
[16]

Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies.SIAM journal on Applied Mathematics, 17, 2, 416–429

work page 1969
[17]

Gurobi Optimization, LLC. 2024. Gurobi Optimizer Reference Manual. (2024). https://www.gurobi.com

work page 2024
[18]

Pengzhan Hao and Yifan Zhang. 2021. EDDL: A distributed deep learning system for resource-limited edge computing environment. InSEC. IEEE, 1–13

work page 2021
[19]

Hennessy and David A

John L. Hennessy and David A. Patterson. 2012.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann

work page 2012
[20]

Junxian Huang, Feng Qian, Alexandre Gerber, Zhuoqing Morley Mao, Sub- habrata Sen, and Oliver Spatscheck. 2012. A close examination of performance and power characteristics of 4g LTE networks. InMobiSys. ACM, 225–238

work page 2012
[21]

Yanping Huang et al. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InNeurIPS, 103–112

work page 2019
[22]

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: running deep neural networks on android smartphones. InECCV Workshops (5)(Lecture Notes in Computer Science). Vol. 11133. Springer, 288–314

work page 2018
[24]

Jared Kaplan et al. 2020. Scaling laws for neural language models. (2020). arXiv: 2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

Rupesh Khendry. 2023. The era of generative AI: driving transformation in capital markets. https://www.microsoft.com/en-us/industry/blog/financial-se rvices/2023/07/10/the-era-of-generative-ai-driving-transformation-in-capit al-markets/. Accessed: 2024-05-17. (2023)

work page 2023
[26]

KubeEdge. 2024. Kubernetes native edge computing framework. https://kubee dge.io/. (2024)

work page 2024
[27]

Madhyastha, and Mosharaf Chowdhury

Fan Lai, Yinwei Dai, Sanjay Sri Vallabh Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. 2022. FedScale: bench- marking model and system performance of federated learning at scale. InICML (Proceedings of Machine Learning Research). Vol. 162. PMLR, 11814–11827

work page 2022
[28]

2012.Extremes and related properties of random sequences and processes

Malcolm R Leadbetter, Georg Lindgren, and Holger Rootzén. 2012.Extremes and related properties of random sequences and processes. Springer Science & Business Media

work page 2012
[29]

Papailiopoulos, and Kannan Ramchandran

Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris S. Papailiopoulos, and Kannan Ramchandran. 2018. Speeding up distributed machine learning using codes.IEEE Trans. Inf. Theory, 64, 3, 1514–1529

work page 2018
[30]

Jan Karel Lenstra, David B Shmoys, and Éva Tardos. 1990. Approximation algo- rithms for scheduling unrelated parallel machines.Mathematical programming, 46, 1, 259–271

work page 1990
[31]

Xing, and Hao Zhang

Dacheng Li, Hongyi Wang, Eric P. Xing, and Hao Zhang. 2022. AMP: auto- matically finding model parallel strategies with heterogeneity awareness. In NeurIPS

work page 2022
[32]

Andersen, Jun Woo Park, Alexander J

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. InOSDI. USENIX Association, 583–598

work page 2014
[33]

Xiangyu Li, Yuanchun Li, Yuanzhe Li, Ting Cao, and Yunxin Liu. 2024. FlexNN: efficient and adaptive DNN inference on memory-constrained edge devices. In MobiCom. ACM, 709–723

work page 2024
[34]

Weijian Liu, Mingzhen Li, Guangming Tan, and Weile Jia. 2025. Mario: near zero-cost activation checkpointing in pipeline parallelism. InPPoPP. ACM, 197–211

work page 2025
[35]

Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret

work page
[36]

Applications of second-order cone programming.Linear algebra and its applications, 284, 1-3, 193–228

work page
[37]

M-Lab. 2021. The M-Lab MobiPerf dataset. https://measurementlab.net/tests /mobiperf. Accessed: 2024-10-17. (2021)

work page 2021
[38]

Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. SDPipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training.Proc. VLDB Endow., 16, 9, 2354–2363

work page 2023
[39]

Rajeev Motwani and Prabhakar Raghavan. 1996. Randomized algorithms.ACM Comput. Surv., 28, 1, 33–37

work page 1996
[40]

Devanur, Gregory R

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InSOSP. ACM, 1–15

work page 2019
[41]

Ashkan Nikravesh, Yihua Guo, Feng Qian, Zhuoqing Morley Mao, and Sub- habrata Sen. 2016. An in-depth understanding of multipath TCP on mobile devices: measurement and system design. InMobiCom. ACM, 189–201

work page 2016
[42]

OASIS. 2019. Mqtt version 5.0.Retrieved June, 22, 2020, 1435

work page 2019
[43]

OASIS. 2012. Oasis advanced message queuing protocol (amqp) version 1.0. International Journal of Aerospace Engineering Hindawi www. hindawi. com, 2018

work page 2012
[44]

OpenAI. 2023. GPT-4 technical report. (2023). arXiv: 2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Park, Gyeongchan Yun, Chang M

Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H. Noh, and Young-ri Choi. 2020. HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. InUSENIX ATC. USENIX Association, 307–321

work page 2020
[46]

Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu

David Patterson, Jeffrey M. Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu. 2024. Energy and emissions of machine learning on smartphones vs. the cloud.Commun. ACM, 67, 2, 86–97

work page 2024
[47]

Shixiong Qi, K. K. Ramakrishnan, and Myungjin Lee. 2024. LIFL: A lightweight, event-driven serverless platform for federated learning. InMLSys. mlsys.org

work page 2024
[48]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. InKDD. ACM, 3505–3506

work page 2020
[49]

R Tyrrell Rockafellar, Stanislav Uryasev, et al. 2000. Optimization of conditional value-at-risk.Journal of risk, 2, 21–42

work page 2000
[50]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. InCVPR. IEEE, 10674–10685

work page 2022
[51]

DJ Russo, B Van Roy, A Kazerouni, I Osband, Z Wen, et al. 2018. A tutorial on thompson sampling. foundations and trends®in machine learning 11 (1): 1–96. (2018)

work page 2018
[52]

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. 2023. SWARM parallelism: training large models can be surprisingly communication- efficient. InICML(Proceedings of Machine Learning Research). Vol. 202. PMLR, 29416–29440

work page 2023
[53]

Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhi- menko. 2021. Moshpit SGD: communication-efficient decentralized training on heterogeneous unreliable devices. InNeurIPS, 18195–18211

work page 2021
[54]

Max Ryabinin and Anton Gusev. 2020. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. InNeurIPS

work page 2020
[55]

Lorenzo Sani et al. 2025. Photon: federated LLM pre-training. InMLSys. ml- sys.org

work page 2025
[56]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: training multi-billion pa- rameter language models using model parallelism. (2019). arXiv: 1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

Craig S. Smith. 2023. What large models cost you – there is no free ai lunch. https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cos t-you--there-is-no-free-ai-lunch/?sh=2b6d10724af7. (Sept. 2023)

work page 2023
[58]

SPEEDTEST. 2025. Speed test global index. https://www.speedtest.net/global-i ndex/united-states. Accessed: 2025-01-27. (2025)

work page 2025
[59]

Zhenheng Tang et al. 2023. FusionAI: decentralized training and deploying LLMs with massive consumer-level GPUs. (2023). arXiv: 2309.01172

work page arXiv 2023
[60]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH.Int. J. High Perform. Comput. Appl., 19, 1, 49–66

work page 2005
[61]

Chandra Thapa, Mahawaga Arachchige Pathum Chamikara, Seyit Camtepe, and Lichao Sun. 2022. SplitFed: when federated learning meets split learning. InAAAI. AAAI Press, 8485–8493. On Harnessing Idle Compute at the Edge for Foundation Model Training

work page 2022
[62]

John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Min- jia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: making preemptible instances resilient for affordable training of large dnns. InNSDI. USENIX Association, 497–513

work page 2023
[63]

Hugo Touvron, Louis Martin, Kevin Stone, and et al. 2023. Llama 2: open foundation and fine-tuned chat models. (2023). arXiv: 2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InNIPS, 5998–6008

work page 2017
[65]

Pietzuch

Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter R. Pietzuch. 2024. Tenplex: dynamic parallelism for deep learning using parallelizable tensor collections. InSOSP. ACM, 195–210

work page 2024
[66]

Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin Wang. 2024. NetLLM: adapting large language models for networking. InSIGCOMM. ACM, 661–678

work page 2024
[67]

Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, and Luo Mai. 2025. MoE-Gen: high-throughput MoE inference on a single gpu with module-based batching. (2025). arXiv: 2503.09716

work page arXiv 2025
[68]

Leyang Xue et al. 2025. Towards decentralized and sustainable foundation model training with the edge.ACM SIGENERGY Energy Informatics Review, 5, 2, 1–9

work page 2025
[69]

Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, and Xu Chen

work page
[70]

InMobiCom

Asteroid: resource-efficient hybrid pipeline parallelism for collaborative DNN training on heterogeneous edge devices. InMobiCom. ACM, 312–326

work page
[71]

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Ré, and Ce Zhang. 2022. Decentralized training of foundation models in heterogeneous environments. InNeurIPS

work page 2022
[72]

Haoran Zhang, Adney Cardoza, Peter Baile Chen, Sebastian Angel, and Vincent Liu. 2020. Fault-tolerant and transactional stateful serverless workflows. In OSDI. USENIX Association, 1187–1204

work page 2020
[73]

Susan Zhang, Stephen Roller, Naman Goyal, and et al. 2022. OPT: open pre- trained transformer language models. (2022). arXiv: 2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

Lianmin Zheng et al. 2022. Alpa: automating inter- and intra-operator paral- lelism for distributed deep learning. InOSDI. USENIX Association, 559–578. A Communication Efficiency: Homogeneous We analyze the per-device communication volume and derive con- ditions under whichCleaveachieves superior communication ef- ficiency compared to conventional paralle...

work page 2022

[1] [1]

Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication.ACM Trans. Parallel Comput., 4, 3, 13:1–13:34. Leyang Xue†, Meghana Madhyastha ‡, Myungjin Lee ⋄ , Amos Storkey †, Randal Burns ‡ and Mahesh K. Marina †

work page 2018

[2] [2]

Backlinko. 2023. Smartphone usage statistics. https://backlinko.com/smartpho ne-usage-statistics. Accessed: 2024-07-28. (2023)

work page 2023

[3] [3]

Bartoldson, Bhavya Kailkhura, and Davis W

Brian R. Bartoldson, Bhavya Kailkhura, and Davis W. Blalock. 2023. Compute- efficient deep learning: algorithmic trends and opportunities.J. Mach. Learn. Res., 24, 122:1–122:77

work page 2023

[4] [4]

Giovanni Bartolomeo, Mehdi Yosofie, Simon Bäurle, Oliver Haluszczynski, Nitinder Mohan, and Jörg Ott. 2023. Oakestra: A lightweight hierarchical orchestration framework for edge computing. InUSENIX ATC. USENIX Asso- ciation, 215–231

work page 2023

[5] [5]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. 2021. On the opportunities and risks of foundation models. (2021). arXiv: 2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

S Boucheron, G Lugosi, and P Massart. 2013. Concentration inequalities: a nonasymptotic theory of independence oxford, uk: oxford univ. (2013)

work page 2013

[7] [7]

BT. 2024. Broadband deals. https://www.bt.com/broadband/deals. (2024)

work page 2024

[8] [8]

Jiasi Chen and Xukan Ran. 2019. Deep learning with edge computing: A review. Proc. IEEE, 107, 8, 1655–1674

work page 2019

[9] [9]

Shenggan Cheng, Ziming Liu, Jiangsu Du, and Yang You. 2023. ATP: adaptive tensor parallelism for foundation models. (2023). arXiv: 2301.08658

work page arXiv 2023

[10] [10]

2004.Order statistics

Herbert A David and Haikady N Nagaraja. 2004.Order statistics. John Wiley & Sons

work page 2004

[11] [11]

L de Haan and A Ferreira. 2006. Extreme value theory: an introduction springer science+ business media.LLC, New York

work page 2006

[12] [12]

Michael Diskin et al. 2021. Distributed deep learning in open collaborations. In NeurIPS, 7879–7897

work page 2021

[13] [13]

Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A systematic survey of general sparse matrix-matrix multiplication.ACM Comput. Surv., 55, 12, 244:1–244:36

work page 2023

[14] [14]

GitHub. 2021. GitHub Copilot·Your AI pair programmer. https://github.com/f eatures/copilot. Accessed: 2024-05-17. (2021)

work page 2021

[15] [15]

Google. 2024. gRPC – an RPC library and framework. https://github.com/grpc /grpc. Accessed: 2024-05-17. (2024)

work page 2024

[16] [16]

Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies.SIAM journal on Applied Mathematics, 17, 2, 416–429

work page 1969

[17] [17]

Gurobi Optimization, LLC. 2024. Gurobi Optimizer Reference Manual. (2024). https://www.gurobi.com

work page 2024

[18] [18]

Pengzhan Hao and Yifan Zhang. 2021. EDDL: A distributed deep learning system for resource-limited edge computing environment. InSEC. IEEE, 1–13

work page 2021

[19] [19]

Hennessy and David A

John L. Hennessy and David A. Patterson. 2012.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann

work page 2012

[20] [20]

Junxian Huang, Feng Qian, Alexandre Gerber, Zhuoqing Morley Mao, Sub- habrata Sen, and Oliver Spatscheck. 2012. A close examination of performance and power characteristics of 4g LTE networks. InMobiSys. ACM, 225–238

work page 2012

[21] [21]

Yanping Huang et al. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InNeurIPS, 103–112

work page 2019

[22] [22]

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: running deep neural networks on android smartphones. InECCV Workshops (5)(Lecture Notes in Computer Science). Vol. 11133. Springer, 288–314

work page 2018

[23] [24]

Jared Kaplan et al. 2020. Scaling laws for neural language models. (2020). arXiv: 2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[24] [25]

Rupesh Khendry. 2023. The era of generative AI: driving transformation in capital markets. https://www.microsoft.com/en-us/industry/blog/financial-se rvices/2023/07/10/the-era-of-generative-ai-driving-transformation-in-capit al-markets/. Accessed: 2024-05-17. (2023)

work page 2023

[25] [26]

KubeEdge. 2024. Kubernetes native edge computing framework. https://kubee dge.io/. (2024)

work page 2024

[26] [27]

Madhyastha, and Mosharaf Chowdhury

Fan Lai, Yinwei Dai, Sanjay Sri Vallabh Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. 2022. FedScale: bench- marking model and system performance of federated learning at scale. InICML (Proceedings of Machine Learning Research). Vol. 162. PMLR, 11814–11827

work page 2022

[27] [28]

2012.Extremes and related properties of random sequences and processes

Malcolm R Leadbetter, Georg Lindgren, and Holger Rootzén. 2012.Extremes and related properties of random sequences and processes. Springer Science & Business Media

work page 2012

[28] [29]

Papailiopoulos, and Kannan Ramchandran

Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris S. Papailiopoulos, and Kannan Ramchandran. 2018. Speeding up distributed machine learning using codes.IEEE Trans. Inf. Theory, 64, 3, 1514–1529

work page 2018

[29] [30]

Jan Karel Lenstra, David B Shmoys, and Éva Tardos. 1990. Approximation algo- rithms for scheduling unrelated parallel machines.Mathematical programming, 46, 1, 259–271

work page 1990

[30] [31]

Xing, and Hao Zhang

Dacheng Li, Hongyi Wang, Eric P. Xing, and Hao Zhang. 2022. AMP: auto- matically finding model parallel strategies with heterogeneity awareness. In NeurIPS

work page 2022

[31] [32]

Andersen, Jun Woo Park, Alexander J

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. InOSDI. USENIX Association, 583–598

work page 2014

[32] [33]

Xiangyu Li, Yuanchun Li, Yuanzhe Li, Ting Cao, and Yunxin Liu. 2024. FlexNN: efficient and adaptive DNN inference on memory-constrained edge devices. In MobiCom. ACM, 709–723

work page 2024

[33] [34]

Weijian Liu, Mingzhen Li, Guangming Tan, and Weile Jia. 2025. Mario: near zero-cost activation checkpointing in pipeline parallelism. InPPoPP. ACM, 197–211

work page 2025

[34] [35]

Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret

work page

[35] [36]

Applications of second-order cone programming.Linear algebra and its applications, 284, 1-3, 193–228

work page

[36] [37]

M-Lab. 2021. The M-Lab MobiPerf dataset. https://measurementlab.net/tests /mobiperf. Accessed: 2024-10-17. (2021)

work page 2021

[37] [38]

Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. SDPipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training.Proc. VLDB Endow., 16, 9, 2354–2363

work page 2023

[38] [39]

Rajeev Motwani and Prabhakar Raghavan. 1996. Randomized algorithms.ACM Comput. Surv., 28, 1, 33–37

work page 1996

[39] [40]

Devanur, Gregory R

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InSOSP. ACM, 1–15

work page 2019

[40] [41]

Ashkan Nikravesh, Yihua Guo, Feng Qian, Zhuoqing Morley Mao, and Sub- habrata Sen. 2016. An in-depth understanding of multipath TCP on mobile devices: measurement and system design. InMobiCom. ACM, 189–201

work page 2016

[41] [42]

OASIS. 2019. Mqtt version 5.0.Retrieved June, 22, 2020, 1435

work page 2019

[42] [43]

OASIS. 2012. Oasis advanced message queuing protocol (amqp) version 1.0. International Journal of Aerospace Engineering Hindawi www. hindawi. com, 2018

work page 2012

[43] [44]

OpenAI. 2023. GPT-4 technical report. (2023). arXiv: 2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [45]

Park, Gyeongchan Yun, Chang M

Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H. Noh, and Young-ri Choi. 2020. HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. InUSENIX ATC. USENIX Association, 307–321

work page 2020

[45] [46]

Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu

David Patterson, Jeffrey M. Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu. 2024. Energy and emissions of machine learning on smartphones vs. the cloud.Commun. ACM, 67, 2, 86–97

work page 2024

[46] [47]

Shixiong Qi, K. K. Ramakrishnan, and Myungjin Lee. 2024. LIFL: A lightweight, event-driven serverless platform for federated learning. InMLSys. mlsys.org

work page 2024

[47] [48]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. InKDD. ACM, 3505–3506

work page 2020

[48] [49]

R Tyrrell Rockafellar, Stanislav Uryasev, et al. 2000. Optimization of conditional value-at-risk.Journal of risk, 2, 21–42

work page 2000

[49] [50]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. InCVPR. IEEE, 10674–10685

work page 2022

[50] [51]

DJ Russo, B Van Roy, A Kazerouni, I Osband, Z Wen, et al. 2018. A tutorial on thompson sampling. foundations and trends®in machine learning 11 (1): 1–96. (2018)

work page 2018

[51] [52]

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. 2023. SWARM parallelism: training large models can be surprisingly communication- efficient. InICML(Proceedings of Machine Learning Research). Vol. 202. PMLR, 29416–29440

work page 2023

[52] [53]

Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhi- menko. 2021. Moshpit SGD: communication-efficient decentralized training on heterogeneous unreliable devices. InNeurIPS, 18195–18211

work page 2021

[53] [54]

Max Ryabinin and Anton Gusev. 2020. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. InNeurIPS

work page 2020

[54] [55]

Lorenzo Sani et al. 2025. Photon: federated LLM pre-training. InMLSys. ml- sys.org

work page 2025

[55] [56]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: training multi-billion pa- rameter language models using model parallelism. (2019). arXiv: 1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[56] [57]

Craig S. Smith. 2023. What large models cost you – there is no free ai lunch. https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cos t-you--there-is-no-free-ai-lunch/?sh=2b6d10724af7. (Sept. 2023)

work page 2023

[57] [58]

SPEEDTEST. 2025. Speed test global index. https://www.speedtest.net/global-i ndex/united-states. Accessed: 2025-01-27. (2025)

work page 2025

[58] [59]

Zhenheng Tang et al. 2023. FusionAI: decentralized training and deploying LLMs with massive consumer-level GPUs. (2023). arXiv: 2309.01172

work page arXiv 2023

[59] [60]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH.Int. J. High Perform. Comput. Appl., 19, 1, 49–66

work page 2005

[60] [61]

Chandra Thapa, Mahawaga Arachchige Pathum Chamikara, Seyit Camtepe, and Lichao Sun. 2022. SplitFed: when federated learning meets split learning. InAAAI. AAAI Press, 8485–8493. On Harnessing Idle Compute at the Edge for Foundation Model Training

work page 2022

[61] [62]

John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Min- jia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: making preemptible instances resilient for affordable training of large dnns. InNSDI. USENIX Association, 497–513

work page 2023

[62] [63]

Hugo Touvron, Louis Martin, Kevin Stone, and et al. 2023. Llama 2: open foundation and fine-tuned chat models. (2023). arXiv: 2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [64]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InNIPS, 5998–6008

work page 2017

[64] [65]

Pietzuch

Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter R. Pietzuch. 2024. Tenplex: dynamic parallelism for deep learning using parallelizable tensor collections. InSOSP. ACM, 195–210

work page 2024

[65] [66]

Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin Wang. 2024. NetLLM: adapting large language models for networking. InSIGCOMM. ACM, 661–678

work page 2024

[66] [67]

Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, and Luo Mai. 2025. MoE-Gen: high-throughput MoE inference on a single gpu with module-based batching. (2025). arXiv: 2503.09716

work page arXiv 2025

[67] [68]

Leyang Xue et al. 2025. Towards decentralized and sustainable foundation model training with the edge.ACM SIGENERGY Energy Informatics Review, 5, 2, 1–9

work page 2025

[68] [69]

Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, and Xu Chen

work page

[69] [70]

InMobiCom

Asteroid: resource-efficient hybrid pipeline parallelism for collaborative DNN training on heterogeneous edge devices. InMobiCom. ACM, 312–326

work page

[70] [71]

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Ré, and Ce Zhang. 2022. Decentralized training of foundation models in heterogeneous environments. InNeurIPS

work page 2022

[71] [72]

Haoran Zhang, Adney Cardoza, Peter Baile Chen, Sebastian Angel, and Vincent Liu. 2020. Fault-tolerant and transactional stateful serverless workflows. In OSDI. USENIX Association, 1187–1204

work page 2020

[72] [73]

Susan Zhang, Stephen Roller, Naman Goyal, and et al. 2022. OPT: open pre- trained transformer language models. (2022). arXiv: 2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022

[73] [74]

Lianmin Zheng et al. 2022. Alpa: automating inter- and intra-operator paral- lelism for distributed deep learning. InOSDI. USENIX Association, 559–578. A Communication Efficiency: Homogeneous We analyze the per-device communication volume and derive con- ditions under whichCleaveachieves superior communication ef- ficiency compared to conventional paralle...

work page 2022