Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

Alexander Long; Chamin Hewa Koneputugodage; Gil Avraham; Hadi Mohaghegh Dolatabadi; Sameera Ramasinghe; Thalaiyasingam Ajanthan; Violetta Shevchenko; Yan Zuo

arxiv: 2605.23464 · v1 · pith:5SXFVXRKnew · submitted 2026-05-22 · 💻 cs.LG

Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

Alexander Long , Chamin Hewa Koneputugodage , Thalaiyasingam Ajanthan , Yan Zuo , Gil Avraham , Violetta Shevchenko , Hadi Mohaghegh Dolatabadi , Sameera Ramasinghe This is my paper

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords unextractable protocol modelsdecentralized trainingmodel shardingcollaborative inferenceinvertible transformsweight materializationextraction prevention

0 comments

The pith

Unextractable Protocol Models keep collaborative neural network shards incoherent across time steps while preserving overall function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Unextractable Protocol Models to support decentralized training and inference where no participant ever holds a complete set of weights. Shards held by different participants receive periodic injections of time-varying random invertible transforms at their boundaries. These transforms leave the end-to-end network unchanged yet ensure that shards assembled from different moments cannot form a coherent model. Experiments with two small language models show that ten thousand such transforms produce negligible shifts in perplexity and output distributions. The protocol adds limited overhead during training and inference while making extraction via direct access or gradient-based stitching require resources comparable to training from scratch.

Core claim

UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent.

What carries the argument

Time-varying random invertible transforms applied at participant boundaries that preserve the network function but render cross-time shard assemblies incoherent.

If this is right

Collaborative training and serving of models becomes feasible without any participant materializing the full weights.
Gradient-based fine-tuning of stitched partitions requires at least 60 percent of the tokens needed to train from scratch.
Direct attacks become impractical and straightforward to defend against under the protocol rules.
Programmatic incentive mechanisms can be embedded directly into community-driven decentralized training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be evaluated on models larger than 1B parameters to check whether overhead and coherence properties scale linearly.
Similar boundary transforms might be adapted to other distributed systems that require partial views to remain inconsistent over time.
If the protocol is followed, it could support verifiable claims about model ownership in multi-party environments.

Load-bearing premise

All participants apply the transforms correctly and keep them secret while the sharded setup blocks any single party from obtaining full weights.

What would settle it

A successful extraction in which colluding participants share transform data to produce a coherent model or a low-cost attack that recovers usable weights despite the protocol.

Figures

Figures reproduced from arXiv: 2605.23464 by Alexander Long, Chamin Hewa Koneputugodage, Gil Avraham, Hadi Mohaghegh Dolatabadi, Sameera Ramasinghe, Thalaiyasingam Ajanthan, Violetta Shevchenko, Yan Zuo.

**Figure 1.** Figure 1: A pipeline parallel setup with 𝑅 pipelines and 𝑆 stages, where each stage is occupied by one of 𝐶 compute nodes. Participants can operate multiple nodes within a stage (color indicates participant identity). Nodes in red indicate an attacker, who hides their identity to try and get access to different stages of the model over time. cryptographic primitives. Notably, code and data can remain open, enabling … view at source ↗

**Figure 2.** Figure 2: Our framework ensures stage weights vary over time by folding in transforms, yet the end-to-end [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Functional Equivalence during Inference. We evaluate the cumulative effect of transforms by measuring logit drift (via Jensen–Shannon distance) and perplexity increase on the WikiText test split across different numbers of orthogonal morphing steps. Results are shown for Qwen 2.5 0.5B and Llama 3.2 1B across multiple precisions. Solid lines denote our high-precision (HP) workaround described in Section 3.4… view at source ↗

**Figure 4.** Figure 4: UPM training. Training on 1.9B FineWeb tokens with Muon across three GPT-2 scales, comparing our UPM framework (with various transforms, morphing occurs every 0.5M tokens) to a no-transform baseline. Top: training loss. Bottom: relative loss gap vs. the baseline. As per the analysis in Section 3.3, orthogonal transforms are behaviorally indistinguishable from the baseline (overlapping curves, constant and … view at source ↗

**Figure 5.** Figure 5: Learning-based Attacks. Left: Tokens required, as a percentage of training from scratch to a perplexity of 20, to reach various perplexities on FineWeb with a stitched Llama 3.2 1B model. We evaluate this with two sets of pretrained model weights that are morphed and then stitched: the base Llama model weights released by Meta (which is trained on a wide variety of data) and our model weights that we train… view at source ↗

**Figure 6.** Figure 6: Tokens required, as a percentage of compute optimal (20B tokens), to reach various perplexities on [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

read the original abstract

We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. In this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant. We introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i.e., subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent. On Qwen-2.5-0.5B and Llama-3.2-1B, 10,000 transforms leave FP32 perplexity unchanged ($\Delta$PPL $< 0.01$; Jensen-Shannon drift $< 4 \times 10^{-5}$), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0.1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1.6% time and $< 1$% memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes $\geq 60$% of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Unextractable Protocol Models (UPMs) for decentralized collaborative training and inference of neural networks in a sharded setup. UPMs periodically apply time-varying random invertible transforms at participant boundaries to preserve the overall network function while ensuring that model shards held by participants are incompatible across different time steps, preventing any participant from materializing or extracting a coherent full model. On Qwen-2.5-0.5B and Llama-3.2-1B, applying 10,000 such transforms leaves FP32 perplexity essentially unchanged (ΔPPL < 0.01, Jensen-Shannon drift < 4×10^{-5}), with reported overheads of 3% latency / 0.1% bandwidth / 10% GPU memory at inference and lower during training. The paper analyzes attacks, claiming direct attacks are impractical to defend against and that gradient-based fine-tuning on stitched partitions requires ≥60% of the tokens needed to train from scratch.

Significance. If the unextractability property holds under the stated assumptions, the approach could enable community-driven decentralized training with embedded programmatic incentives by making full model extraction impractical. The empirical demonstration of negligible functional impact and modest overheads on small models is a concrete strength, as is the explicit quantification of attack costs relative to from-scratch training.

major comments (2)

[Attack analysis / security model] The central unextractability claim (abstract and attack analysis) rests on the assumption that every participant correctly applies the current random invertible transform and that these transforms remain secret without collusion or leakage. No mechanism (e.g., cryptographic commitments or verifiable computation) is described to enforce this; any violation allows cancellation of mismatches and coherent assembly, directly undermining the incoherence property across time steps.
[Empirical results] Empirical validation of performance preservation and attack resistance is reported only for 0.5B–1B parameter models. It is unclear whether the transform injection, precision control, and incoherence properties remain load-bearing without degradation or prohibitive overhead when scaling to the larger models that would most benefit from decentralized training.

minor comments (2)

[Abstract] The abstract states that growth can be controlled for lower-precision datatypes but provides no explicit method, equation, or quantitative results for this control.
[Methods] Notation for the transforms (e.g., how invertibility is ensured while maintaining exact functional equivalence) could be clarified with a short formal definition or pseudocode in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, providing clarifications on the security model and scalability while noting where revisions will be made.

read point-by-point responses

Referee: [Attack analysis / security model] The central unextractability claim (abstract and attack analysis) rests on the assumption that every participant correctly applies the current random invertible transform and that these transforms remain secret without collusion or leakage. No mechanism (e.g., cryptographic commitments or verifiable computation) is described to enforce this; any violation allows cancellation of mismatches and coherent assembly, directly undermining the incoherence property across time steps.

Authors: The manuscript presents UPMs under a standard semi-honest model in which participants are assumed to execute the protocol correctly, including applying the prescribed time-varying transforms at each step. The attack analysis quantifies the cost of extraction attempts that operate within these protocol boundaries (e.g., cross-time stitching or fine-tuning of assembled shards). We agree that the paper does not provide cryptographic enforcement mechanisms such as commitments or verifiable computation; any such mechanism would be an orthogonal layer that could be composed with UPMs but would introduce additional overhead not analyzed here. We will revise the security-model subsection to explicitly articulate these trust assumptions and note the absence of enforcement primitives as a limitation for fully adversarial settings. revision: partial
Referee: [Empirical results] Empirical validation of performance preservation and attack resistance is reported only for 0.5B–1B parameter models. It is unclear whether the transform injection, precision control, and incoherence properties remain load-bearing without degradation or prohibitive overhead when scaling to the larger models that would most benefit from decentralized training.

Authors: The reported experiments use 0.5B and 1B models to isolate the effect of repeated invertible transforms under controlled precision. The core algebraic properties—invertibility, functional equivalence, and controlled drift—are independent of model scale. Overhead scales linearly with shard dimension and is already shown to be a small fraction of baseline cost; for larger models the relative communication and memory overhead is expected to decrease. We will add a dedicated paragraph in the discussion section that extrapolates these scaling trends, quantifies expected precision-control requirements for FP16/BF16 on larger models, and acknowledges that direct empirical confirmation on models beyond 1B remains future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: protocol defined from first principles with external empirical validation

full rationale

The paper defines UPMs by explicitly constructing time-varying random invertible transforms applied at participant boundaries. Preservation of the overall network function follows directly from the invertibility property (a standard linear-algebra fact), while cross-time incoherence is the intended consequence of the time-varying choice; neither is derived from data or prior results. Perplexity and drift measurements on Qwen-2.5-0.5B and Llama-3.2-1B are reported as external benchmarks, not fitted parameters renamed as predictions. No equations, self-citations, or uniqueness theorems appear in the supplied text that would reduce the central claim to its own inputs by construction. Security assumptions about honest protocol adherence are stated as prerequisites rather than derived.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach depends on the assumption of invertible transforms and empirical testing on specific models; no new physical entities but a new protocol.

free parameters (1)

transform frequency = every 30 seconds
Selected to achieve security-overhead tradeoff as stated in abstract.

axioms (1)

domain assumption Random invertible transforms preserve the overall network function when applied at boundaries
Core to maintaining model performance while achieving incoherence.

invented entities (1)

Unextractable Protocol Models (UPMs) no independent evidence
purpose: To enable collaborative training without weight materialization
Newly proposed framework in the paper.

pith-pipeline@v0.9.0 · 5869 in / 1395 out tokens · 28525 ms · 2026-05-25T04:34:20.387531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 2 internal anchors

[1]

Foundations and Trends in Machine Learning , volume =

Advances and open problems in federated learning , author =. Foundations and Trends in Machine Learning , volume =. 2021 , publisher =

work page 2021
[2]

Vogels, Thijs and Karimireddy, Sai Praneeth and Jaggi, Martin , booktitle =

work page
[3]

A unified theory of decentralized

Koloskova, Anastasia and Loizou, Nicolas and Boreiri, Sadra and Jaggi, Martin and Stich, Sebastian , booktitle =. A unified theory of decentralized. 2020 , organization =

work page 2020
[4]

Advances in Neural Information Processing Systems , volume =

Decentralized training of foundation models in heterogeneous environments , author =. Advances in Neural Information Processing Systems , volume =

work page
[5]

Advances in Neural Information Processing Systems , volume =

Distributed deep learning in open collaborations , author =. Advances in Neural Information Processing Systems , volume =

work page
[6]

Advances in Neural Information Processing Systems , volume =

Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent , author =. Advances in Neural Information Processing Systems , volume =

work page
[7]

Advances in Neural Information Processing Systems , volume =

Attention is all you need , author =. Advances in Neural Information Processing Systems , volume =

work page
[8]

Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing , pages =

Fully homomorphic encryption using ideal lattices , author =. Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing , pages =

work page
[9]

2024 , eprint =

DiLoCo: Distributed low-communication training of language models , author =. 2024 , eprint =

work page 2024
[10]

DeepSeek-V3 Technical Report

Liu, Aixin and Feng, Bei and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Lu, Chengda and Zhao, Chenggang and Deng, Chengqi and Zhang, Chenyu and Ruan, Chong and others , year =. 2412.19437 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2024 , eprint =

The Llama 3 herd of models , author =. 2024 , eprint =

work page 2024
[12]

2024 , eprint =

The rising costs of training frontier AI models , author =. 2024 , eprint =

work page 2024
[13]

2023 , eprint =

Federated foundation models: Privacy-preserving and collaborative learning for large models , author =. 2023 , eprint =

work page 2023
[14]

2023 , eprint =

When foundation model meets federated learning: Motivations, challenges, and future directions , author =. 2023 , eprint =

work page 2023
[15]

2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS) , pages =

Incentive design for efficient federated learning in mobile networks: A contract theory approach , author =. 2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS) , pages =. 2019 , organization =

work page 2019
[16]

2021 , publisher =

Peng, Zhe and Xu, Jianliang and Chu, Xiaowen and Gao, Shang and Yao, Yuan and Gu, Rong and Tang, Yuzhe , journal =. 2021 , publisher =

work page 2021
[17]

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , pages =

Communication-efficient learning of deep networks from decentralized data , author =. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , pages =. 2017 , organization =

work page 2017
[18]

2020 , howpublished =

work page 2020
[19]

Ryabinin, Max and Dettmers, Tim and Diskin, Michael and Borzunov, Alexander , booktitle =

work page
[20]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages =

Petals: Collaborative Inference and Fine-tuning of Large Models , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages =. 2023 , doi =

work page 2023
[21]

Advances in Neural Information Processing Systems , volume =

Towards crowdsourced training of large neural networks using decentralized mixture-of-experts , author =. Advances in Neural Information Processing Systems , volume =

work page
[22]

Ryabinin, Max and Gorbunov, Eduard and Plokhotnyuk, Vsevolod and Pekhimenko, Gennady , booktitle =. Moshpit

work page
[23]

IEEE Transactions on Information Theory , volume =

Randomized gossip algorithms , author =. IEEE Transactions on Information Theory , volume =. 2006 , publisher =

work page 2006
[24]

2016 , eprint =

Gossip training for deep learning , author =. 2016 , eprint =

work page 2016
[25]

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) , pages =

Communication-efficient decentralized learning with sparsification and adaptive peer selection , author =. 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) , pages =. 2020 , organization =

work page 2020
[26]

Proceedings of the 35th International Conference on Machine Learning , pages =

Asynchronous decentralized parallel stochastic gradient descent , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , organization =

work page 2018
[27]

and Wu, Yonghui and others , booktitle =

Huang, Yanping and Cheng, Youlong and Bapna, Ankur and Firat, Orhan and Chen, Dehao and Chen, Mia and Lee, HyoukJoong and Ngiam, Jiquan and Le, Quoc V. and Wu, Yonghui and others , booktitle =

work page
[28]

and Ganger, Gregory R

Narayanan, Deepak and Harlap, Aaron and Phanishayee, Amar and Seshadri, Vivek and Devanur, Nikhil R. and Ganger, Gregory R. and Gibbons, Phillip B. and Zaharia, Matei , booktitle =. 2019 , doi =

work page 2019
[29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , year =. 1909.08053 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

Computer , volume =

The carbon footprint of machine learning training will plateau, then shrink , author =. Computer , volume =. 2022 , publisher =

work page 2022
[31]

2020 , eprint =

Scaling laws for neural language models , author =. 2020 , eprint =

work page 2020
[32]

IEEE Spectrum , volume =

Deep learning's diminishing returns: The cost of improvement is becoming unsustainable , author =. IEEE Spectrum , volume =. 2021 , publisher =

work page 2021
[33]

2020 , eprint =

The de-democratization of AI: Deep learning and the compute divide in artificial intelligence research , author =. 2020 , eprint =

work page 2020
[34]

IEEE Transactions on Knowledge and Data Engineering , volume =

A survey on federated learning systems: Vision, hype and reality for data privacy and protection , author =. IEEE Transactions on Knowledge and Data Engineering , volume =. 2021 , publisher =

work page 2021
[35]

Proceedings of the 41st International Conference on Machine Learning , pages =

Stealing part of a production language model , author =. Proceedings of the 41st International Conference on Machine Learning , pages =

work page
[36]

USENIX Security Symposium , pages =

Stealing Machine Learning Models via Prediction APIs , author =. USENIX Security Symposium , pages =

work page
[37]

International Conference on Learning Representations , year =

Thieves on Sesame Street! Model Extraction of BERT-based APIs , author =. International Conference on Learning Representations , year =

work page
[38]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

Imitation Attacks and Defenses for Black-box Machine Translation Systems , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2020
[39]

Communications of the ACM , volume =

How to Share a Secret , author =. Communications of the ACM , volume =

work page
[40]

Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12--15, 2019, Proceedings, Part I , pages =

Training Behavior of Deep Neural Network in Frequency Domain , author =. Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12--15, 2019, Proceedings, Part I , pages =. 2019 , publisher =

work page 2019
[41]

Communications in Computational Physics , volume =

Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , author =. Communications in Computational Physics , volume =. 2020 , doi =

work page 2020
[42]

Proceedings of the 36th International Conference on Machine Learning , editor =

On the Spectral Bias of Neural Networks , author =. Proceedings of the 36th International Conference on Machine Learning , editor =. 2019 , publisher =

work page 2019
[43]

2024 , url =

Muon: An optimizer for hidden layers in neural networks , author =. 2024 , url =

work page 2024
[44]

International Conference on Learning Representations , year =

Pointer sentinel mixture models , author =. International Conference on Learning Representations , year =

work page
[45]

2025 , eprint =

Qwen2.5 Technical Report , author =. 2025 , eprint =

work page 2025
[46]

2024 , url =

modded-nanogpt: Speedrunning the NanoGPT baseline , author =. 2024 , url =

work page 2024
[47]

2024 , howpublished =

NanoGPT Speedrun Worklog , author =. 2024 , howpublished =

work page 2024
[48]

Advances in Neural Information Processing Systems , volume =

Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems , volume =. 2019 , url =

work page 2019
[49]

Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

work page 2024
[50]

2007 , eprint =

How to generate random matrices from the classical compact groups , author =. 2007 , eprint =

work page 2007
[51]

2012 , publisher =

Matrix Analysis , author =. 2012 , publisher =

work page 2012
[52]

2002 , publisher =

Accuracy and Stability of Numerical Algorithms , author =. 2002 , publisher =

work page 2002
[53]

Advances in Neural Information Processing Systems , volume =

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =. Advances in Neural Information Processing Systems , volume =. 2019 , publisher =

work page 2019
[54]

2024 , month = apr, url =

torchtune: PyTorch's finetuning library , author =. 2024 , month = apr, url =

work page 2024
[55]

TorchTitan: One-stop PyTorch native solution for production ready

Liang, Wanchao and Liu, Tianyu and Wright, Less and Constable, Will and Gu, Andrew and Huang, Chien-Chin and Zhang, Iris and Feng, Wei and Huang, Howard and Wang, Junjie and Purandare, Sanket and Nadathur, Gokul and Idreos, Stratos , booktitle =. TorchTitan: One-stop PyTorch native solution for production ready. 2025 , url =

work page 2025
[56]

2023 , url =

nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs , author =. 2023 , url =

work page 2023
[57]

2025 , eprint =

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism , author =. 2025 , eprint =

work page 2025
[58]

Proceedings of the 42nd International Conference on Machine Learning , year =

Nesterov Method for Asynchronous Pipeline Parallel Optimization , author =. Proceedings of the 42nd International Conference on Machine Learning , year =

work page
[59]

2025 , url =

Node0: Model Parallel Training over the Internet with Protocol Models , author =. 2025 , url =

work page 2025
[60]

2025 , eprint =

A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning , author =. 2025 , eprint =

work page 2025
[61]

Research Policy , volume =

Profiting from Technological Innovation: Implications for Integration, Collaboration, Licensing and Public Policy , author =. Research Policy , volume =. 1986 , doi =

work page 1986
[62]

The Rate and Direction of Inventive Activity: Economic and Social Factors , editor =

Economic Welfare and the Allocation of Resources for Invention , author =. The Rate and Direction of Inventive Activity: Economic and Social Factors , editor =. 1962 , publisher =. doi:10.1515/9781400879762-024 , url =

work page doi:10.1515/9781400879762-024 1962
[63]

California Management Review , volume =

How to Capture Value from Innovation: Shaping Intellectual Property and Industry Architecture , author =. California Management Review , volume =. 2007 , doi =

work page 2007
[64]

2005 , publisher =

Understanding Institutional Diversity , author =. 2005 , publisher =

work page 2005
[65]

The Review of Economics and Statistics , volume =

The Pure Theory of Public Expenditure , author =. The Review of Economics and Statistics , volume =. 1954 , doi =

work page 1954
[66]

The Yale Law Journal , volume =

The Continuum of Excludability and the Limits of Patents , author =. The Yale Law Journal , volume =. 2013 , url =

work page 2013

[1] [1]

Foundations and Trends in Machine Learning , volume =

Advances and open problems in federated learning , author =. Foundations and Trends in Machine Learning , volume =. 2021 , publisher =

work page 2021

[2] [2]

Vogels, Thijs and Karimireddy, Sai Praneeth and Jaggi, Martin , booktitle =

work page

[3] [3]

A unified theory of decentralized

Koloskova, Anastasia and Loizou, Nicolas and Boreiri, Sadra and Jaggi, Martin and Stich, Sebastian , booktitle =. A unified theory of decentralized. 2020 , organization =

work page 2020

[4] [4]

Advances in Neural Information Processing Systems , volume =

Decentralized training of foundation models in heterogeneous environments , author =. Advances in Neural Information Processing Systems , volume =

work page

[5] [5]

Advances in Neural Information Processing Systems , volume =

Distributed deep learning in open collaborations , author =. Advances in Neural Information Processing Systems , volume =

work page

[6] [6]

Advances in Neural Information Processing Systems , volume =

Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent , author =. Advances in Neural Information Processing Systems , volume =

work page

[7] [7]

Advances in Neural Information Processing Systems , volume =

Attention is all you need , author =. Advances in Neural Information Processing Systems , volume =

work page

[8] [8]

Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing , pages =

Fully homomorphic encryption using ideal lattices , author =. Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing , pages =

work page

[9] [9]

2024 , eprint =

DiLoCo: Distributed low-communication training of language models , author =. 2024 , eprint =

work page 2024

[10] [10]

DeepSeek-V3 Technical Report

Liu, Aixin and Feng, Bei and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Lu, Chengda and Zhao, Chenggang and Deng, Chengqi and Zhang, Chenyu and Ruan, Chong and others , year =. 2412.19437 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

2024 , eprint =

The Llama 3 herd of models , author =. 2024 , eprint =

work page 2024

[12] [12]

2024 , eprint =

The rising costs of training frontier AI models , author =. 2024 , eprint =

work page 2024

[13] [13]

2023 , eprint =

Federated foundation models: Privacy-preserving and collaborative learning for large models , author =. 2023 , eprint =

work page 2023

[14] [14]

2023 , eprint =

When foundation model meets federated learning: Motivations, challenges, and future directions , author =. 2023 , eprint =

work page 2023

[15] [15]

2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS) , pages =

Incentive design for efficient federated learning in mobile networks: A contract theory approach , author =. 2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS) , pages =. 2019 , organization =

work page 2019

[16] [16]

2021 , publisher =

Peng, Zhe and Xu, Jianliang and Chu, Xiaowen and Gao, Shang and Yao, Yuan and Gu, Rong and Tang, Yuzhe , journal =. 2021 , publisher =

work page 2021

[17] [17]

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , pages =

Communication-efficient learning of deep networks from decentralized data , author =. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , pages =. 2017 , organization =

work page 2017

[18] [18]

2020 , howpublished =

work page 2020

[19] [19]

Ryabinin, Max and Dettmers, Tim and Diskin, Michael and Borzunov, Alexander , booktitle =

work page

[20] [20]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages =

Petals: Collaborative Inference and Fine-tuning of Large Models , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages =. 2023 , doi =

work page 2023

[21] [21]

Advances in Neural Information Processing Systems , volume =

Towards crowdsourced training of large neural networks using decentralized mixture-of-experts , author =. Advances in Neural Information Processing Systems , volume =

work page

[22] [22]

Ryabinin, Max and Gorbunov, Eduard and Plokhotnyuk, Vsevolod and Pekhimenko, Gennady , booktitle =. Moshpit

work page

[23] [23]

IEEE Transactions on Information Theory , volume =

Randomized gossip algorithms , author =. IEEE Transactions on Information Theory , volume =. 2006 , publisher =

work page 2006

[24] [24]

2016 , eprint =

Gossip training for deep learning , author =. 2016 , eprint =

work page 2016

[25] [25]

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) , pages =

Communication-efficient decentralized learning with sparsification and adaptive peer selection , author =. 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) , pages =. 2020 , organization =

work page 2020

[26] [26]

Proceedings of the 35th International Conference on Machine Learning , pages =

Asynchronous decentralized parallel stochastic gradient descent , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , organization =

work page 2018

[27] [27]

and Wu, Yonghui and others , booktitle =

Huang, Yanping and Cheng, Youlong and Bapna, Ankur and Firat, Orhan and Chen, Dehao and Chen, Mia and Lee, HyoukJoong and Ngiam, Jiquan and Le, Quoc V. and Wu, Yonghui and others , booktitle =

work page

[28] [28]

and Ganger, Gregory R

Narayanan, Deepak and Harlap, Aaron and Phanishayee, Amar and Seshadri, Vivek and Devanur, Nikhil R. and Ganger, Gregory R. and Gibbons, Phillip B. and Zaharia, Matei , booktitle =. 2019 , doi =

work page 2019

[29] [29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , year =. 1909.08053 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1909

[30] [30]

Computer , volume =

The carbon footprint of machine learning training will plateau, then shrink , author =. Computer , volume =. 2022 , publisher =

work page 2022

[31] [31]

2020 , eprint =

Scaling laws for neural language models , author =. 2020 , eprint =

work page 2020

[32] [32]

IEEE Spectrum , volume =

Deep learning's diminishing returns: The cost of improvement is becoming unsustainable , author =. IEEE Spectrum , volume =. 2021 , publisher =

work page 2021

[33] [33]

2020 , eprint =

The de-democratization of AI: Deep learning and the compute divide in artificial intelligence research , author =. 2020 , eprint =

work page 2020

[34] [34]

IEEE Transactions on Knowledge and Data Engineering , volume =

A survey on federated learning systems: Vision, hype and reality for data privacy and protection , author =. IEEE Transactions on Knowledge and Data Engineering , volume =. 2021 , publisher =

work page 2021

[35] [35]

Proceedings of the 41st International Conference on Machine Learning , pages =

Stealing part of a production language model , author =. Proceedings of the 41st International Conference on Machine Learning , pages =

work page

[36] [36]

USENIX Security Symposium , pages =

Stealing Machine Learning Models via Prediction APIs , author =. USENIX Security Symposium , pages =

work page

[37] [37]

International Conference on Learning Representations , year =

Thieves on Sesame Street! Model Extraction of BERT-based APIs , author =. International Conference on Learning Representations , year =

work page

[38] [38]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

Imitation Attacks and Defenses for Black-box Machine Translation Systems , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2020

[39] [39]

Communications of the ACM , volume =

How to Share a Secret , author =. Communications of the ACM , volume =

work page

[40] [40]

Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12--15, 2019, Proceedings, Part I , pages =

Training Behavior of Deep Neural Network in Frequency Domain , author =. Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12--15, 2019, Proceedings, Part I , pages =. 2019 , publisher =

work page 2019

[41] [41]

Communications in Computational Physics , volume =

Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , author =. Communications in Computational Physics , volume =. 2020 , doi =

work page 2020

[42] [42]

Proceedings of the 36th International Conference on Machine Learning , editor =

On the Spectral Bias of Neural Networks , author =. Proceedings of the 36th International Conference on Machine Learning , editor =. 2019 , publisher =

work page 2019

[43] [43]

2024 , url =

Muon: An optimizer for hidden layers in neural networks , author =. 2024 , url =

work page 2024

[44] [44]

International Conference on Learning Representations , year =

Pointer sentinel mixture models , author =. International Conference on Learning Representations , year =

work page

[45] [45]

2025 , eprint =

Qwen2.5 Technical Report , author =. 2025 , eprint =

work page 2025

[46] [46]

2024 , url =

modded-nanogpt: Speedrunning the NanoGPT baseline , author =. 2024 , url =

work page 2024

[47] [47]

2024 , howpublished =

NanoGPT Speedrun Worklog , author =. 2024 , howpublished =

work page 2024

[48] [48]

Advances in Neural Information Processing Systems , volume =

Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems , volume =. 2019 , url =

work page 2019

[49] [49]

Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

work page 2024

[50] [50]

2007 , eprint =

How to generate random matrices from the classical compact groups , author =. 2007 , eprint =

work page 2007

[51] [51]

2012 , publisher =

Matrix Analysis , author =. 2012 , publisher =

work page 2012

[52] [52]

2002 , publisher =

Accuracy and Stability of Numerical Algorithms , author =. 2002 , publisher =

work page 2002

[53] [53]

Advances in Neural Information Processing Systems , volume =

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =. Advances in Neural Information Processing Systems , volume =. 2019 , publisher =

work page 2019

[54] [54]

2024 , month = apr, url =

torchtune: PyTorch's finetuning library , author =. 2024 , month = apr, url =

work page 2024

[55] [55]

TorchTitan: One-stop PyTorch native solution for production ready

Liang, Wanchao and Liu, Tianyu and Wright, Less and Constable, Will and Gu, Andrew and Huang, Chien-Chin and Zhang, Iris and Feng, Wei and Huang, Howard and Wang, Junjie and Purandare, Sanket and Nadathur, Gokul and Idreos, Stratos , booktitle =. TorchTitan: One-stop PyTorch native solution for production ready. 2025 , url =

work page 2025

[56] [56]

2023 , url =

nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs , author =. 2023 , url =

work page 2023

[57] [57]

2025 , eprint =

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism , author =. 2025 , eprint =

work page 2025

[58] [58]

Proceedings of the 42nd International Conference on Machine Learning , year =

Nesterov Method for Asynchronous Pipeline Parallel Optimization , author =. Proceedings of the 42nd International Conference on Machine Learning , year =

work page

[59] [59]

2025 , url =

Node0: Model Parallel Training over the Internet with Protocol Models , author =. 2025 , url =

work page 2025

[60] [60]

2025 , eprint =

A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning , author =. 2025 , eprint =

work page 2025

[61] [61]

Research Policy , volume =

Profiting from Technological Innovation: Implications for Integration, Collaboration, Licensing and Public Policy , author =. Research Policy , volume =. 1986 , doi =

work page 1986

[62] [62]

The Rate and Direction of Inventive Activity: Economic and Social Factors , editor =

Economic Welfare and the Allocation of Resources for Invention , author =. The Rate and Direction of Inventive Activity: Economic and Social Factors , editor =. 1962 , publisher =. doi:10.1515/9781400879762-024 , url =

work page doi:10.1515/9781400879762-024 1962

[63] [63]

California Management Review , volume =

How to Capture Value from Innovation: Shaping Intellectual Property and Industry Architecture , author =. California Management Review , volume =. 2007 , doi =

work page 2007

[64] [64]

2005 , publisher =

Understanding Institutional Diversity , author =. 2005 , publisher =

work page 2005

[65] [65]

The Review of Economics and Statistics , volume =

The Pure Theory of Public Expenditure , author =. The Review of Economics and Statistics , volume =. 1954 , doi =

work page 1954

[66] [66]

The Yale Law Journal , volume =

The Continuum of Excludability and the Limits of Patents , author =. The Yale Law Journal , volume =. 2013 , url =

work page 2013