pith. sign in

arxiv: 2606.19025 · v2 · pith:6NNQ2QMLnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI· cs.DC· cs.SY· eess.SY

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

Pith reviewed 2026-06-26 21:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DCcs.SYeess.SY
keywords Mixture of Expertsdistributed trainingcommunication efficiencypartial replicationskip token mechanismlarge language modelsfederated compute
0
0 comments X

The pith

FoMoE trains large MoEs across ordinary links by partitioning experts and skipping non-resident ones instead of full replicas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FoMoE as a way to train Mixture-of-Experts models without placing a complete copy of the model on every worker. It partitions the expert layers so each worker holds only some experts and simply drops any token that would need an expert held elsewhere. This change cuts the volume of data that must cross slow network links while still allowing the model to learn. A reader would care if the approach holds because it removes the need for expensive high-speed datacenter networks when pooling compute from many separate machines.

Core claim

FoMoE breaks the full-replica paradigm by partitioning expert layers across workers and skipping non-resident experts during local training. In controlled regimes this yields communication reductions of up to 1.42 times versus efficient baselines and 45.44 times versus Distributed Data Parallelism, empirical throughput gains of up to 1.4 times from the skip-token step, and stable routing behavior, with system modeling indicating the same pattern extends to 100-billion-parameter scales.

What carries the argument

The skip-token mechanism that drops tokens requiring experts not present on the local worker during each training step, paired with partial expert replication across the federation of sites.

If this is right

  • Communication volume falls by up to 1.42 times relative to prior low-communication baselines and by 45.44 times relative to standard data-parallel training.
  • Throughput rises by as much as 1.4 times because the local worker never waits for missing experts.
  • Routing decisions remain stable once training completes in the tested settings.
  • The same partial-replication pattern projects to 100-billion-parameter models with continued memory and bandwidth savings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Geographically scattered machines could now contribute to a single large MoE run without a dedicated high-speed fabric connecting them.
  • The same token-skipping idea might transfer to other sparse activation patterns that currently assume every parameter is reachable everywhere.
  • If skipping proves neutral, it questions whether every expert must be resident at every site for the routing network to learn correctly.

Load-bearing premise

Skipping tokens that need non-resident experts leaves model quality and expert routing unbiased enough for convergence to continue normally.

What would settle it

A side-by-side training run in which final perplexity or downstream accuracy falls measurably when the skip-token rule is active versus an otherwise identical full-replica baseline.

Figures

Figures reproduced from arXiv: 2606.19025 by Alex Iacob, Andrej Jovanovic, Lorenzo Sani, Meghdad Kurmanji, Nicholas D. Lane, Wanru Zhao, Yan Gao, Zeyu Cao.

Figure 1
Figure 1. Figure 1: FoMoE breaking the “full-replica” barrier for WAN￾connected MoE training. Methods such as DiLoCo and Photon maintain a full MoE replica at each site and communicate this object at round boundaries. FoMoE instead leverages MoE sparsity so each worker trains and communicates only the experts assigned to it. This design targets cross-site WAN settings where activation dispatch is too expensive. Bandwidth: ful… view at source ↗
Figure 2
Figure 2. Figure 2: Parameter distribution trends. Scaling width (dmodel) dramatically increases the proportion of sparse parameters (θsparse), motivating the focus of our strategy. 3.3 Partitioning and placement strategies FoMoE enables the distribution of sparse parameters con￾trolled via two coupled design decisions: partitioning, which determines expert replication and balances memory and bandwidth; and placement, which m… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between a transformer MoE block as a full versus partial replica. In the partial-replica form, the same experts can be selected per token as in the full-replica case, but missing experts are ignored by the local computational graph, yielding savings in FLOP, communication, and memory. ing model states and may affect local optimizer momentum (which we discuss in Sec. 3.4). In practice, the reshuf… view at source ↗
Figure 4
Figure 4. Figure 4: Expert placement and ring synchronization. In a full￾replica regime, each available expert incurs its own full ring across workers. As Oe → 1, the communication cost R(Oe) decreases because each expert ring spans fewer workers; at Oe = 1, each ring is localized to the worker that owns the expert. 3.4 Synchronization and model optimization FoMoE training (as we show in Alg. 2) proceeds in rounds (McMahan et… view at source ↗
Figure 5
Figure 5. Figure 5: Pareto frontier of communication costs versus perplexity for full-replica baselines across outer optimizers. “Centralized” is the high-bandwidth single-site quality reference. LocalAdamW offers a better trade-off than DiLoCo within a given commu￾nication budget, motivating the use of the stateful optimizer in subsequent FoMoE configurations. cause synchronizing optimizer state improves stability in low-fre… view at source ↗
Figure 7
Figure 7. Figure 7: Impact of “skip-token” and “ghost experts” on the perplexity versus FLOPs Pareto frontier. Across outer optimizer configurations, these mechanisms shift the empirical compute￾quality frontier toward lower FLOP cost at comparable perplexity, and lower perplexity at comparable FLOP cost, relative to full￾replica baselines. FLOPs Scalability [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical Pareto frontiers comparing communication cost (top) and computation cost (bottom) against perplexity for FoMoE across different outer optimizers and placement strate￾gies. In this sweep, the strongest observed non-dominated con￾figurations occur with LocalAdamW and fixed expert placement. Momentum-vector averaging and expert reassignment affect the trade-off between regularization, data mixing, a… view at source ↗
Figure 8
Figure 8. Figure 8: Analytical/modeled total training FLOPs versus [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Analytical/modeled cumulative communication over [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analytical/modeled total training time versus ex [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: FoMoE Layer Architecture. The diagram illustrates the Peripheral LayerNorm (Peri-LN) strategy, where normalization is applied exclusively to the branches feeding the Attention and MoE blocks (Pre-LN) and immediately following them (Post-LN). This design maintains a linear residual stream to preserve gradient magnitude. Additionally, the architecture includes a depth multi￾plier applied to the output of ea… view at source ↗
Figure 13
Figure 13. Figure 13: Learning rate tuning results for centralized (local step size of one) and federated baselines across varying local steps, architectures (dense vs. MoE), and aggregation methods. Across all configurations, a base learning rate of η ≈ 0.01 consistently minimizes perplexity. We adopt this optimal rate for subsequent model scaling experiments. D THEORETICAL FOUNDATIONS AND COST MODELING This appendix serves a… view at source ↗
read the original abstract

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. Mixture-of-Experts (MoEs) architectures partially decouple model capacity from per-token compute. This efficiency alone does not make MoE training feasible over ordinary Internet links or loosely connected commodity hardware since active expert routing still assumes high-speed datacenter fabrics. Low-communication methods such as DiLoCo and Photon reduce synchronization frequency across distributed sites, mitigating bandwidth constraints, yet still require full model replicas at every site. This creates a mismatch: modern MoEs have sparse data paths, but their distributed training infrastructure remains communication-dense and memory-inefficient, limiting attempts to pool geographically distributed compute. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers and skipping non-resident experts during local training. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over Distributed Data Parallelism (DDP) via partial expert replication in controlled regimes; (II) achieves empirical throughput speedups of up to 1.4x through the skip-token mechanism; and (III) shows stable routing in the trained regimes and projects the communication/memory benefits to 100B-scale configurations through system modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces FoMoE, a federated training approach for Mixture-of-Experts (MoE) models that partitions expert layers across workers and skips tokens requiring non-resident experts during local training. It claims communication cost reductions of up to 1.42x over efficient baselines and 45.44x over Distributed Data Parallelism via partial expert replication, empirical throughput speedups of up to 1.4x from the skip-token mechanism, stable routing in trained regimes, and projected benefits at 100B-scale via system modeling.

Significance. If the empirical measurements and the assumption that token skipping preserves model quality hold under the reported conditions, the work would enable MoE pre-training over lower-bandwidth links without requiring full model replicas at every site, addressing a key barrier for geographically distributed or commodity hardware setups.

minor comments (2)
  1. [Abstract] Abstract: the reported factors (1.42x communication reduction, 1.4x throughput) are presented without error bars, dataset names, model sizes, or hardware configurations, which limits independent assessment of the measurements.
  2. [Abstract] Abstract: no mention of ablation experiments or conditions under which the skip-token policy was validated, leaving the central assumption about preserved routing stability and convergence without explicit falsification tests.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review of our manuscript. We are encouraged by the positive assessment of significance, conditional on the empirical results and token-skipping assumption holding. No specific major comments were enumerated in the provided report, so we have no point-by-point rebuttals to offer at this stage. We remain available to supply additional experiments, clarifications, or revisions should the referee raise concrete concerns.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a systems contribution with empirical measurements of communication reduction (1.42x/45.44x), throughput speedup (1.4x), and routing stability under partial expert replication and token skipping. No equations, fitted parameters, derivation chains, or load-bearing self-citations appear in the abstract or described claims. All reported gains are direct experimental outcomes rather than quantities derived from other fitted quantities or prior self-citations within the paper. The system modeling projection for 100B-scale is noted but does not alter the empirical core.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and empirical paper. The abstract introduces no mathematical derivations, fitted constants, background axioms, or new physical entities; the contribution is an engineering architecture and its measured performance.

pith-pipeline@v0.9.1-grok · 5811 in / 1156 out tokens · 19612 ms · 2026-06-26T21:23:16.270475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

154 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Deep Learning with Differential Privacy , booktitle =

    Mart. Deep Learning with Differential Privacy , booktitle =

  2. [2]

    Dan Alistarh and Demjan Grubic and Jerry Li and Ryota Tomioka and Milan Vojnovic , title =

  3. [3]

    EuroSys , pages =

    Sanjith Athlur and Nitika Saran and Muthian Sivathanu and Ramachandran Ramjee and Nipun Kwatra , title =. EuroSys , pages =

  4. [4]

    Bonawitz and Vladimir Ivanov and Ben Kreuter and Antonio Marcedone and H

    Kallista A. Bonawitz and Vladimir Ivanov and Ben Kreuter and Antonio Marcedone and H. Brendan McMahan and Sarvar Patel and Daniel Ramage and Aaron Segal and Karn Seth , title =

  5. [5]

    CoRR , volume =

    Pietro Cagnasso and Eugene Belilovsky and Edouard Oyallon , title =. CoRR , volume =. 2026 , doi =

  6. [6]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

    Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

  7. [7]

    2016 , eprint =

    Training Deep Nets with Sublinear Memory Cost , author =. 2016 , eprint =

  8. [8]

    Tiancheng Chen and Ales Kubicek and Langwen Huang and Torsten Hoefler , title =

  9. [9]

    Ziheng Cheng and Margalit Glasgow , title =

  10. [10]

    Approximating Two-Layer Feedforward Networks for Efficient Transformers , booktitle =

    R. Approximating Two-Layer Feedforward Networks for Efficient Transformers , booktitle =

  11. [11]

    CoRR , volume =

    Nolan Dey and Bin Claire Zhang and Lorenzo Noci and Mufan Bill Li and Blake Bordelon and Shane Bergsma and Cengiz Pehlevan and Boris Hanin and Joel Hestness , title =. CoRR , volume =

  12. [12]

    Rusu and Rachita Chhaparia and Yani Donchev and Adhiguna Kuncoro and Marc'Aurelio Ranzato and Arthur Szlam and Jiajun Shen , title =

    Arthur Douillard and Qixuan Feng and Andrei A. Rusu and Rachita Chhaparia and Yani Donchev and Adhiguna Kuncoro and Marc'Aurelio Ranzato and Arthur Szlam and Jiajun Shen , title =. CoRR , volume =. 2023 , doi =

  13. [13]

    Rusu and Adhiguna Kuncoro and Yani Donchev and Rachita Chhaparia and Ionel Gog and Marc'Aurelio Ranzato and Jiajun Shen and Arthur Szlam , title =

    Arthur Douillard and Qixuan Feng and Andrei A. Rusu and Adhiguna Kuncoro and Yani Donchev and Rachita Chhaparia and Ionel Gog and Marc'Aurelio Ranzato and Jiajun Shen and Arthur Szlam , title =. CoRR , volume =

  14. [14]

    Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch , journal =

    Arthur Douillard and Yanislav Donchev and Keith Rush and Satyen Kale and Zachary Charles and Zachary Garrett and Gabriel Teston and Dave Lacey and Ross McIlroy and Jiajun Shen and Alexandre Ram. Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch , journal =

  15. [15]

    Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P

    Nan Du and Yanping Huang and Andrew M. Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P. Bosma and Zongwei Zhou and Tao Wang and Yu Emma Wang and Kellie Webster and Marie Pellat and Kevin Robinson and Kathleen S. Meier. GLaM: Efficient Scal...

  16. [16]

    William Fedus and Barret Zoph and Noam Shazeer , title =. J. Mach. Learn. Res. , volume =

  17. [17]

    MLSys , publisher =

    Trevor Gale and Deepak Narayanan and Cliff Young and Matei Zaharia , title =. MLSys , publisher =

  18. [18]

    2021 , doi =

    Guo, Binbin and Mei, Yuan and Xiao, Danyang and Wu, Weigang , booktitle =. 2021 , doi =

  19. [19]

    PPoPP , pages =

    Jiaao He and Jidong Zhai and Tiago Antunes and Haojie Wang and Fuwen Luo and Shangfeng Shi and Qin Li , title =. PPoPP , pages =

  20. [20]

    Rae and Oriol Vinyals and Laurent Sifre , title =

    Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katie Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and Eric...

  21. [21]

    Le and Yonghui Wu and Zhifeng Chen , title =

    Yanping Huang and Youlong Cheng and Ankur Bapna and Orhan Firat and Dehao Chen and Mia Xu Chen and HyoukJoong Lee and Jiquan Ngiam and Quoc V. Le and Yonghui Wu and Zhifeng Chen , title =. NeurIPS , pages =

  22. [22]

    MLSys , publisher =

    Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong , title =. MLSys , publisher =

  23. [23]

    Shen and Xinchi Qiu and Dongqi Cai and Yan Gao and Nicholas Donald Lane , title =

    Alex Iacob and Lorenzo Sani and Meghdad Kurmanji and William F. Shen and Xinchi Qiu and Dongqi Cai and Yan Gao and Nicholas Donald Lane , title =

  24. [24]

    CoRR , volume =

    Alex Iacob and Lorenzo Sani and Mher Safaryan and Paris Giampouras and Samuel Horv. CoRR , volume =

  25. [25]

    Jacobs and Michael I

    Robert A. Jacobs and Michael I. Jordan and Steven J. Nowlan and Geoffrey E. Hinton , title =. Neural Comput. , volume =

  26. [26]

    CoRR , volume =

    Sami Jaghouar and Jack Min Ong and Johannes Hagemann , title =. CoRR , volume =

  27. [27]

    Jordan and Robert A

    Michael I. Jordan and Robert A. Jacobs , title =. Neural Comput. , volume =

  28. [28]

    Brendan McMahan and Brendan Avent and Aur

    Peter Kairouz and H. Brendan McMahan and Brendan Avent and Aur. Advances and Open Problems in Federated Learning , journal =

  29. [29]

    Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =

    Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. CoRR , volume =

  30. [30]

    CoRR , volume =

    Ahmed Khaled and Satyen Kale and Arthur Douillard and Chi Jin and Rob Fergus and Manzil Zaheer , title =. CoRR , volume =

  31. [31]

    Scaling Laws for Fine-Grained Mixture of Experts , booktitle =

    Jan Ludziejewski and Jakub Krajewski and Kamil Adamczewski and Maciej Pi. Scaling Laws for Fine-Grained Mixture of Experts , booktitle =

  32. [32]

    Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , title =

  33. [33]

    Mike Lewis and Shruti Bhosale and Tim Dettmers and Naman Goyal and Luke Zettlemoyer , title =

  34. [34]

    MLSys , publisher =

    Tian Li and Anit Kumar Sahu and Manzil Zaheer and Maziar Sanjabi and Ameet Talwalkar and Virginia Smith , title =. MLSys , publisher =

  35. [36]

    Hwijoon Lim and Juncheol Ye and Sangeetha Abdu Jyothi and Dongsu Han , title =

  36. [37]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Xudong Lu and Qi Liu and Yuhui Xu and Aojun Zhou and Siyuan Huang and Bo Zhang and Junchi Yan and Hongsheng Li , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , doi =

  37. [38]

    DeepSeek-V2:

    DeepSeek. DeepSeek-V2:. CoRR , volume =

  38. [39]

    Ilya Loshchilov and Frank Hutter , title =

  39. [40]

    CoRR , volume =

    Jan Malasnicki and Kamil Ciebiera and Mateusz Borun and Maciej Pi. CoRR , volume =

  40. [41]

    Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

    Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

  41. [43]

    Diamos and Erich Elsen and David Garc

    Paulius Micikevicius and Sharan Narang and Jonah Alben and Gregory F. Diamos and Erich Elsen and David Garc. Mixed Precision Training , booktitle =

  42. [44]

    Oberman and Mohammad Shoeybi and Michael Y

    Paulius Micikevicius and Dusan Stosic and Neil Burgess and Marius Cornea and Pradeep Dubey and Richard Grisenthwaite and Sangwon Ha and Alexander Heinecke and Patrick Judd and John Kamalu and Naveen Mellempudi and Stuart F. Oberman and Mohammad Shoeybi and Michael Y. Siu and Hao Wu , title =. CoRR , volume =

  43. [45]

    Devanur and Gregory R

    Deepak Narayanan and Aaron Harlap and Amar Phanishayee and Vivek Seshadri and Nikhil R. Devanur and Gregory R. Ganger and Phillip B. Gibbons and Matei Zaharia , title =

  44. [46]

    Pitch Patarasuk and Xin Yuan , title =. J. Parallel Distributed Comput. , volume =

  45. [47]

    CoRR , volume =

    Ji Qi and WenPeng Zhu and Li Li and Ming Wu and YingJun Wu and Wu He and Xun Gao and Jason Zeng and Michael Heinrich , title =. CoRR , volume =

  46. [48]

    International Conference on Computational Science , series =

    Rolf Rabenseifner , title =. International Conference on Computational Science , series =

  47. [49]

    Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , title =

  48. [50]

    Samyam Rajbhandari and Conglong Li and Zhewei Yao and Minjia Zhang and Reza Yazdani Aminabadi and Ammar Ahmad Awan and Jeff Rasley and Yuxiong He , title =

  49. [51]

    Reddi and Zachary Charles and Manzil Zaheer and Zachary Garrett and Keith Rush and Jakub Kone

    Sashank J. Reddi and Zachary Charles and Manzil Zaheer and Zachary Garrett and Keith Rush and Jakub Kone. Adaptive Federated Optimization , booktitle =

  50. [52]

    CoRR , volume =

    Matthias Reisser and Christos Louizos and Efstratios Gavves and Max Welling , title =. CoRR , volume =

  51. [53]

    Scaling Vision with Sparse Mixture of Experts , booktitle =

    Carlos Riquelme and Joan Puigcerver and Basil Mustafa and Maxim Neumann and Rodolphe Jenatton and Andr. Scaling Vision with Sparse Mixture of Experts , booktitle =

  52. [54]

    NeurIPS , pages =

    Stephen Roller and Sainbayar Sukhbaatar and Arthur Szlam and Jason Weston , title =. NeurIPS , pages =

  53. [55]

    Advances in Neural Information Processing Systems , volume =

    Towards Crowdsourced Training of Large Neural Networks Using Decentralized Mixture-of-Experts , author =. Advances in Neural Information Processing Systems , volume =. 2020 , doi =

  54. [56]

    Proceedings of the 40th International Conference on Machine Learning , series =

    Max Ryabinin and Tim Dettmers and Michael Diskin and Alexander Borzunov , title =. Proceedings of the 40th International Conference on Machine Learning , series =. 2023 , url =

  55. [57]

    Communication Efficient

    Amir Sarfi and Benjamin Th. Communication Efficient. CoRR , volume =. 2025 , doi =

  56. [58]

    Lane , booktitle =

    Lorenzo Sani and Alex Iacob and Zeyu Cao and Royson Lee and Bill Marino and Yan Gao and Wanru Zhao and Dongqi Cai and Zexi Li and Xinchi Qiu and Nicholas D. Lane , booktitle =. Photon: Federated. 2025 , url =

  57. [59]

    International Workshop on Federated Foundation Models in Conjunction with NeurIPS 2024 , year =

    The Future of Large Language Model Pre-training is Federated , author =. International Workshop on Federated Foundation Models in Conjunction with NeurIPS 2024 , year =

  58. [60]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =

  59. [61]

    CoRR , volume =

    Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro , title =. CoRR , volume =

  60. [62]

    Proceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems , pages =

    Advancing accuracy in energy forecasting using mixture-of-experts and federated learning , author =. Proceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems , pages =

  61. [63]

    Stich , title =

    Sebastian U. Stich , title =

  62. [64]

    ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year =

    Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models , author =. ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year =

  63. [65]

    Gomez and Lukasz Kaiser and Illia Polosukhin , title =

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =

  64. [66]

    Patterson , title =

    Samuel Williams and Andrew Waterman and David A. Patterson , title =. Commun

  65. [67]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

    dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

  66. [68]

    Hu , title =

    Greg Yang and Edward J. Hu , title =

  67. [69]

    Specialized Federated Learning Using a Mixture of Experts , journal =

    Edvin Listo Zec and Olof Mogren and John Martinsson and Leon Ren. Specialized Federated Learning Using a Mixture of Experts , journal =. 2020 , doi =

  68. [70]

    Mingshu Zhai and Jiaao He and Zixuan Ma and Zan Zong and Runqing Zhang and Jidong Zhai , title =

  69. [71]

    Zhao and Andrew M

    Yanqi Zhou and Tao Lei and Hanxiao Liu and Nan Du and Yanping Huang and Vincent Y. Zhao and Andrew M. Dai and Zhifeng Chen and Quoc V. Le and James Laudon , title =. NeurIPS , year =

  70. [72]

    Padmanabhan , title =

    Palak and Tella Rajashekhar Reddy and Bhaskar Kataria and Rohan Gandhi and Karan Tandon and Debopam Bhattacherjee and Venkata N. Padmanabhan , title =. CoRR , volume =. 2024 , doi =

  71. [73]

    CoRR , volume =

    Gheorghe Comanici and Eric Bieber and Mike Schaekermann and Ice Pasupat and Noveen Sachdeva and others , title =. CoRR , volume =. 2025 , doi =

  72. [74]

    Jeonghoon Kim and Byeongchan Lee and Cheonbok Park and Yeontaek Oh and Beomjun Kim and Taehwan Yoo and Seongjin Shin and Dongyoon Han and Jinwoo Shin and Kang Min Yoo , title =

  73. [75]

    2025 , url=

    Could decentralized training solve AI’s power problem? , author=. 2025 , url=

  74. [76]

    Inside the world’s most powerful AI Datacenter , url=

    Guthrie, Scott , year=. Inside the world’s most powerful AI Datacenter , url=. Official Microsoft Blog , publisher=

  75. [77]

    Jianlin Su and Murtadha H. M. Ahmed and Yu Lu and Shengfeng Pan and Wen Bo and Yunfeng Liu , title =. Neurocomputing , volume =

  76. [78]

    2025 , journal=

    Kimi K2: Open Agentic Intelligence , author=. 2025 , journal=

  77. [79]

    CoRR , volume =

    OpenAI , title =. CoRR , volume =

  78. [80]

    2024 , url =

    Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro , title =. 2024 , url =

  79. [81]

    SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention , booktitle =

    R. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention , booktitle =

  80. [82]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , volume =. 2020 , url =

Showing first 80 references.