On Harnessing Idle Compute at the Edge for Foundation Model Training
Pith reviewed 2026-05-16 22:26 UTC · model grok-4.3
The pith
Cleave trains foundation models on edge devices by exploiting GEMM's asymmetric I/O pattern to reach cloud-comparable speeds while scaling to thousands of heterogeneous nodes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cleave achieves cloud-comparable GPU training performance by aligning GEMM operations with edge network bandwidth asymmetries in a parameter-server architecture, allowing per-device communication to decrease with scale, and scales to thousands of heterogeneous devices with at least 100x faster failure recovery than prior systems.
What carries the argument
Parameter-server-centric architecture that decomposes training into independent sub-GEMM tasks to unify memory constraints, communication overhead, and fault tolerance under device churn.
Load-bearing premise
The asymmetric I/O pattern of GEMM operations can be exploited at scale on real edge networks without hidden overheads from memory fragmentation, synchronization, or network variability that would erase the claimed speedups.
What would settle it
Deploy Cleave on a large real-world testbed of heterogeneous edge devices with measured variable network conditions and check whether the 4-10x runtime gains and 100x faster recovery times hold compared to baselines.
Figures
read the original abstract
The foundation-model ecosystem remains highly centralized because training requires immense compute resources and is therefore largely limited to large cloud operators. Edge-assisted foundation model training that harnesses spare compute on edge devices offers a more democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-training performance, scale to larger models, fit within device memory limits, or keep communication overhead manageable. They also do not handle device heterogeneity and churn satisfactorily. We introduce Cleave, built on a structural insight: each GEMM has an asymmetric I/O pattern -- its input matrices, sent over downlink, are much larger than the partial output blocks returned over uplink -- matching edge networks where downlink bandwidth exceeds uplink by 2--10x. Exploiting this alignment with a parameter-server-centric architecture, Cleave makes per-device communication \emph{decrease} as more devices join, rather than stay constant as in conventional TP. Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn. Our evaluation shows that Cleave achieves cloud-comparable GPU training performance and outperforms state-of-the-art edge-training methods by 4--10x in per-batch runtime at the same device counts. Beyond this shared operating range, Cleave scales to thousands of heterogeneous devices -- a regime where prior edge-training systems cannot operate -- and achieves at least 100x faster recovery from device failures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Cleave, a parameter-server-centric system for edge-assisted foundation model training. It exploits the asymmetric I/O pattern of GEMM operations (large downlink inputs, small uplink partial outputs) to make per-device communication decrease with scale, decomposes training into independent sub-GEMM tasks, and uses a unified scheduler that jointly handles memory constraints, communication, and fault tolerance under device heterogeneity and churn. The evaluation claims cloud-comparable GPU performance, 4-10x per-batch runtime improvement over prior edge-training methods at the same device counts, scalability to thousands of heterogeneous devices, and at least 100x faster recovery from device failures.
Significance. If the performance and scaling claims hold under realistic conditions, the work could meaningfully advance democratized foundation-model training by utilizing idle edge resources. The alignment of GEMM asymmetry with typical edge network bandwidth ratios and the single scheduling abstraction for memory/communication/fault-tolerance are potentially valuable contributions to distributed ML systems.
major comments (3)
- [Abstract] Abstract: performance numbers (4-10x runtime, 100x recovery) and scaling claims to thousands of devices are stated without any reference to experimental setup, baselines, error bars, or how device heterogeneity and churn were modeled, leaving the central claims unsupported by visible evidence.
- [Evaluation] Evaluation section: no quantitative breakdown of communication volume versus device count or failure rate is provided, so the claim that per-device traffic decreases with scale (and the extrapolation beyond small-scale tests) cannot be assessed.
- [Architecture] Architecture and scheduling description: the assumption that sub-GEMM decomposition plus unified scheduling incurs no hidden synchronization or rescheduling overhead under churn and network variability is load-bearing for the 4-10x and 100x claims, yet no measurements or analysis of these overheads are shown.
minor comments (1)
- [Abstract] The phrase 'cloud-comparable GPU training performance' is imprecise; specify the exact metrics, model sizes, and cloud baseline used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions that will strengthen the presentation of our results and claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance numbers (4-10x runtime, 100x recovery) and scaling claims to thousands of devices are stated without any reference to experimental setup, baselines, error bars, or how device heterogeneity and churn were modeled, leaving the central claims unsupported by visible evidence.
Authors: We agree the abstract would be improved by explicit pointers to supporting details. In the revised version we will append a brief clause directing readers to Section 5, where the experimental setup (including trace-driven modeling of heterogeneity and churn), baselines, and error bars are fully described. The reported numbers derive from those experiments. revision: yes
-
Referee: [Evaluation] Evaluation section: no quantitative breakdown of communication volume versus device count or failure rate is provided, so the claim that per-device traffic decreases with scale (and the extrapolation beyond small-scale tests) cannot be assessed.
Authors: The evaluation section reports aggregate communication costs but lacks the requested per-device breakdown. We will add a new figure and accompanying text in Section 5 that plots uplink and downlink volume per device as functions of device count (10–2000) and failure rate (0–20 %), confirming the decrease predicted by the GEMM asymmetry and supporting the scaling extrapolation. revision: yes
-
Referee: [Architecture] Architecture and scheduling description: the assumption that sub-GEMM decomposition plus unified scheduling incurs no hidden synchronization or rescheduling overhead under churn and network variability is load-bearing for the 4-10x and 100x claims, yet no measurements or analysis of these overheads are shown.
Authors: We collected these overhead measurements during our experiments but did not isolate them in the text. The revised architecture section will include a dedicated analysis and microbenchmark results showing that synchronization and rescheduling overheads remain below 5 % of runtime even at 15 % churn and under realistic network variability, thereby substantiating the performance claims. revision: yes
Circularity Check
No circularity: architecture insight and scaling claims rest on empirical evaluation without self-referential derivations or fitted predictions
full rationale
The provided manuscript text contains no equations, parameter fits, or mathematical derivations. The core claim (asymmetric GEMM I/O exploited via parameter-server decomposition to reduce per-device communication with scale) is presented as a structural observation matched to edge network properties, followed by system implementation and evaluation results. No self-citations are used to justify uniqueness theorems or to smuggle ansatzes. The 4-10x and 100x recovery claims are tied to reported measurements rather than reducing by construction to inputs. This is a standard systems paper whose central results are externally falsifiable via replication on hardware; no load-bearing step collapses to a self-definition or renamed fit.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cleave achieves cloud-comparable GPU training performance... scales to thousands of heterogeneous devices
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication.ACM Trans. Parallel Comput., 4, 3, 13:1–13:34. Leyang Xue†, Meghana Madhyastha ‡, Myungjin Lee ⋄ , Amos Storkey †, Randal Burns ‡ and Mahesh K. Marina †
work page 2018
-
[2]
Backlinko. 2023. Smartphone usage statistics. https://backlinko.com/smartpho ne-usage-statistics. Accessed: 2024-07-28. (2023)
work page 2023
-
[3]
Bartoldson, Bhavya Kailkhura, and Davis W
Brian R. Bartoldson, Bhavya Kailkhura, and Davis W. Blalock. 2023. Compute- efficient deep learning: algorithmic trends and opportunities.J. Mach. Learn. Res., 24, 122:1–122:77
work page 2023
-
[4]
Giovanni Bartolomeo, Mehdi Yosofie, Simon Bäurle, Oliver Haluszczynski, Nitinder Mohan, and Jörg Ott. 2023. Oakestra: A lightweight hierarchical orchestration framework for edge computing. InUSENIX ATC. USENIX Asso- ciation, 215–231
work page 2023
-
[5]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. 2021. On the opportunities and risks of foundation models. (2021). arXiv: 2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
S Boucheron, G Lugosi, and P Massart. 2013. Concentration inequalities: a nonasymptotic theory of independence oxford, uk: oxford univ. (2013)
work page 2013
-
[7]
BT. 2024. Broadband deals. https://www.bt.com/broadband/deals. (2024)
work page 2024
-
[8]
Jiasi Chen and Xukan Ran. 2019. Deep learning with edge computing: A review. Proc. IEEE, 107, 8, 1655–1674
work page 2019
- [9]
-
[10]
Herbert A David and Haikady N Nagaraja. 2004.Order statistics. John Wiley & Sons
work page 2004
-
[11]
L de Haan and A Ferreira. 2006. Extreme value theory: an introduction springer science+ business media.LLC, New York
work page 2006
-
[12]
Michael Diskin et al. 2021. Distributed deep learning in open collaborations. In NeurIPS, 7879–7897
work page 2021
-
[13]
Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A systematic survey of general sparse matrix-matrix multiplication.ACM Comput. Surv., 55, 12, 244:1–244:36
work page 2023
-
[14]
GitHub. 2021. GitHub Copilot·Your AI pair programmer. https://github.com/f eatures/copilot. Accessed: 2024-05-17. (2021)
work page 2021
-
[15]
Google. 2024. gRPC – an RPC library and framework. https://github.com/grpc /grpc. Accessed: 2024-05-17. (2024)
work page 2024
-
[16]
Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies.SIAM journal on Applied Mathematics, 17, 2, 416–429
work page 1969
-
[17]
Gurobi Optimization, LLC. 2024. Gurobi Optimizer Reference Manual. (2024). https://www.gurobi.com
work page 2024
-
[18]
Pengzhan Hao and Yifan Zhang. 2021. EDDL: A distributed deep learning system for resource-limited edge computing environment. InSEC. IEEE, 1–13
work page 2021
-
[19]
John L. Hennessy and David A. Patterson. 2012.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann
work page 2012
-
[20]
Junxian Huang, Feng Qian, Alexandre Gerber, Zhuoqing Morley Mao, Sub- habrata Sen, and Oliver Spatscheck. 2012. A close examination of performance and power characteristics of 4g LTE networks. InMobiSys. ACM, 225–238
work page 2012
-
[21]
Yanping Huang et al. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InNeurIPS, 103–112
work page 2019
-
[22]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: running deep neural networks on android smartphones. InECCV Workshops (5)(Lecture Notes in Computer Science). Vol. 11133. Springer, 288–314
work page 2018
-
[24]
Jared Kaplan et al. 2020. Scaling laws for neural language models. (2020). arXiv: 2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[25]
Rupesh Khendry. 2023. The era of generative AI: driving transformation in capital markets. https://www.microsoft.com/en-us/industry/blog/financial-se rvices/2023/07/10/the-era-of-generative-ai-driving-transformation-in-capit al-markets/. Accessed: 2024-05-17. (2023)
work page 2023
-
[26]
KubeEdge. 2024. Kubernetes native edge computing framework. https://kubee dge.io/. (2024)
work page 2024
-
[27]
Madhyastha, and Mosharaf Chowdhury
Fan Lai, Yinwei Dai, Sanjay Sri Vallabh Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. 2022. FedScale: bench- marking model and system performance of federated learning at scale. InICML (Proceedings of Machine Learning Research). Vol. 162. PMLR, 11814–11827
work page 2022
-
[28]
2012.Extremes and related properties of random sequences and processes
Malcolm R Leadbetter, Georg Lindgren, and Holger Rootzén. 2012.Extremes and related properties of random sequences and processes. Springer Science & Business Media
work page 2012
-
[29]
Papailiopoulos, and Kannan Ramchandran
Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris S. Papailiopoulos, and Kannan Ramchandran. 2018. Speeding up distributed machine learning using codes.IEEE Trans. Inf. Theory, 64, 3, 1514–1529
work page 2018
-
[30]
Jan Karel Lenstra, David B Shmoys, and Éva Tardos. 1990. Approximation algo- rithms for scheduling unrelated parallel machines.Mathematical programming, 46, 1, 259–271
work page 1990
-
[31]
Dacheng Li, Hongyi Wang, Eric P. Xing, and Hao Zhang. 2022. AMP: auto- matically finding model parallel strategies with heterogeneity awareness. In NeurIPS
work page 2022
-
[32]
Andersen, Jun Woo Park, Alexander J
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. InOSDI. USENIX Association, 583–598
work page 2014
-
[33]
Xiangyu Li, Yuanchun Li, Yuanzhe Li, Ting Cao, and Yunxin Liu. 2024. FlexNN: efficient and adaptive DNN inference on memory-constrained edge devices. In MobiCom. ACM, 709–723
work page 2024
-
[34]
Weijian Liu, Mingzhen Li, Guangming Tan, and Weile Jia. 2025. Mario: near zero-cost activation checkpointing in pipeline parallelism. InPPoPP. ACM, 197–211
work page 2025
-
[35]
Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Hervé Lebret
-
[36]
Applications of second-order cone programming.Linear algebra and its applications, 284, 1-3, 193–228
-
[37]
M-Lab. 2021. The M-Lab MobiPerf dataset. https://measurementlab.net/tests /mobiperf. Accessed: 2024-10-17. (2021)
work page 2021
-
[38]
Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. SDPipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training.Proc. VLDB Endow., 16, 9, 2354–2363
work page 2023
-
[39]
Rajeev Motwani and Prabhakar Raghavan. 1996. Randomized algorithms.ACM Comput. Surv., 28, 1, 33–37
work page 1996
-
[40]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InSOSP. ACM, 1–15
work page 2019
-
[41]
Ashkan Nikravesh, Yihua Guo, Feng Qian, Zhuoqing Morley Mao, and Sub- habrata Sen. 2016. An in-depth understanding of multipath TCP on mobile devices: measurement and system design. InMobiCom. ACM, 189–201
work page 2016
-
[42]
OASIS. 2019. Mqtt version 5.0.Retrieved June, 22, 2020, 1435
work page 2019
-
[43]
OASIS. 2012. Oasis advanced message queuing protocol (amqp) version 1.0. International Journal of Aerospace Engineering Hindawi www. hindawi. com, 2018
work page 2012
-
[44]
OpenAI. 2023. GPT-4 technical report. (2023). arXiv: 2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H. Noh, and Young-ri Choi. 2020. HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. InUSENIX ATC. USENIX Association, 307–321
work page 2020
-
[46]
Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu
David Patterson, Jeffrey M. Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu. 2024. Energy and emissions of machine learning on smartphones vs. the cloud.Commun. ACM, 67, 2, 86–97
work page 2024
-
[47]
Shixiong Qi, K. K. Ramakrishnan, and Myungjin Lee. 2024. LIFL: A lightweight, event-driven serverless platform for federated learning. InMLSys. mlsys.org
work page 2024
-
[48]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. InKDD. ACM, 3505–3506
work page 2020
-
[49]
R Tyrrell Rockafellar, Stanislav Uryasev, et al. 2000. Optimization of conditional value-at-risk.Journal of risk, 2, 21–42
work page 2000
-
[50]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. InCVPR. IEEE, 10674–10685
work page 2022
-
[51]
DJ Russo, B Van Roy, A Kazerouni, I Osband, Z Wen, et al. 2018. A tutorial on thompson sampling. foundations and trends®in machine learning 11 (1): 1–96. (2018)
work page 2018
-
[52]
Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. 2023. SWARM parallelism: training large models can be surprisingly communication- efficient. InICML(Proceedings of Machine Learning Research). Vol. 202. PMLR, 29416–29440
work page 2023
-
[53]
Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhi- menko. 2021. Moshpit SGD: communication-efficient decentralized training on heterogeneous unreliable devices. InNeurIPS, 18195–18211
work page 2021
-
[54]
Max Ryabinin and Anton Gusev. 2020. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. InNeurIPS
work page 2020
-
[55]
Lorenzo Sani et al. 2025. Photon: federated LLM pre-training. InMLSys. ml- sys.org
work page 2025
-
[56]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: training multi-billion pa- rameter language models using model parallelism. (2019). arXiv: 1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[57]
Craig S. Smith. 2023. What large models cost you – there is no free ai lunch. https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cos t-you--there-is-no-free-ai-lunch/?sh=2b6d10724af7. (Sept. 2023)
work page 2023
-
[58]
SPEEDTEST. 2025. Speed test global index. https://www.speedtest.net/global-i ndex/united-states. Accessed: 2025-01-27. (2025)
work page 2025
- [59]
-
[60]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH.Int. J. High Perform. Comput. Appl., 19, 1, 49–66
work page 2005
-
[61]
Chandra Thapa, Mahawaga Arachchige Pathum Chamikara, Seyit Camtepe, and Lichao Sun. 2022. SplitFed: when federated learning meets split learning. InAAAI. AAAI Press, 8485–8493. On Harnessing Idle Compute at the Edge for Foundation Model Training
work page 2022
-
[62]
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Min- jia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: making preemptible instances resilient for affordable training of large dnns. InNSDI. USENIX Association, 497–513
work page 2023
-
[63]
Hugo Touvron, Louis Martin, Kevin Stone, and et al. 2023. Llama 2: open foundation and fine-tuned chat models. (2023). arXiv: 2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InNIPS, 5998–6008
work page 2017
- [65]
-
[66]
Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin Wang. 2024. NetLLM: adapting large language models for networking. InSIGCOMM. ACM, 661–678
work page 2024
- [67]
-
[68]
Leyang Xue et al. 2025. Towards decentralized and sustainable foundation model training with the edge.ACM SIGENERGY Energy Informatics Review, 5, 2, 1–9
work page 2025
-
[69]
Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, and Xu Chen
- [70]
-
[71]
Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Ré, and Ce Zhang. 2022. Decentralized training of foundation models in heterogeneous environments. InNeurIPS
work page 2022
-
[72]
Haoran Zhang, Adney Cardoza, Peter Baile Chen, Sebastian Angel, and Vincent Liu. 2020. Fault-tolerant and transactional stateful serverless workflows. In OSDI. USENIX Association, 1187–1204
work page 2020
-
[73]
Susan Zhang, Stephen Roller, Naman Goyal, and et al. 2022. OPT: open pre- trained transformer language models. (2022). arXiv: 2205.01068
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[74]
Lianmin Zheng et al. 2022. Alpa: automating inter- and intra-operator paral- lelism for distributed deep learning. InOSDI. USENIX Association, 559–578. A Communication Efficiency: Homogeneous We analyze the per-device communication volume and derive con- ditions under whichCleaveachieves superior communication ef- ficiency compared to conventional paralle...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.