FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment for Edge Computing

Boyang Zhang; Jian Zhang; Mingxuan Sun; Shuai Zhang; Songyang Zhang; Xiangwei Zhou; Xiaobing Chen

arxiv: 2512.23070 · v2 · pith:YMXBHK37new · submitted 2025-12-28 · 💻 cs.LG

FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment for Edge Computing

Boyang Zhang , Xiaobing Chen , Songyang Zhang , Shuai Zhang , Xiangwei Zhou , Jian Zhang , Mingxuan Sun This is my paper

Pith reviewed 2026-05-21 15:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords Federated learningMixture of expertsEdge computingLoad balancingExpert assignmentNon-IID dataResource constraints

0 comments

The pith

FLEX-MoE assigns experts to edge clients via fitness scores and optimization to achieve both specialization and system-wide load balance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a federated mixture-of-experts model can be deployed on resource-limited clients by scoring how well each expert suits a client's local data from training feedback and then solving an optimization problem that assigns experts while keeping their usage balanced across the whole system. This matters because non-IID data normally causes some experts to be overused and others idle, which hurts accuracy when clients cannot store every expert. A sympathetic reader would care if the method works because it could let wireless and IoT networks run large conditional models without central data collection or expert overload. If the central claim holds, models would reach higher accuracy and more even expert utilization than greedy assignment schemes that ignore global balance.

Core claim

FLEX-MoE introduces client-expert fitness scores that quantify expert suitability for local datasets through training feedback, and employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide.

What carries the argument

Client-expert fitness scores derived from local training feedback together with an optimization algorithm that assigns experts to maximize specialization subject to global load-balance constraints.

If this is right

Higher accuracy than greedy methods that focus only on personalization.
Consistently balanced expert utilization across heterogeneous edge clients.
Joint satisfaction of client capacity limits and system-wide load balance.
Reduced performance degradation from expert skew in non-IID federated settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Even load distribution could cut total wireless communication volume by avoiding repeated activation of the same experts.
The scoring idea may transfer to other conditional computation architectures used in distributed training.
Dynamic re-scoring after client dropouts could keep balance stable in mobile IoT networks.

Load-bearing premise

The optimization algorithm runs efficiently on resource-constrained clients with limited communication and the fitness scores from early training reliably forecast long-term expert value without adding bias or instability.

What would settle it

An experiment that measures the optimization step requiring more communication rounds than the model updates themselves or that leaves expert loads skewed after many rounds would falsify the claim.

Figures

Figures reproduced from arXiv: 2512.23070 by Boyang Zhang, Jian Zhang, Mingxuan Sun, Shuai Zhang, Songyang Zhang, Xiangwei Zhou, Xiaobing Chen.

read the original abstract

Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation, offering enhanced effectiveness and efficiency for next-generation wireless communications. However, deploying MoE with federated learning (FL) over wireless and IoT edge networks faces two critical challenges: 1) resource-constrained clients cannot store large AI models with full expert sets, and 2) non-IID data distributions cause severe expert load imbalance that degrades model performance. To this end, we propose FLEX-MoE, a federated MoE framework that jointly optimizes expert assignment and load balancing under limited client capacity. Specifically, our approach introduces client-expert fitness scores that quantify expert suitability for local datasets through training feedback, and employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide. Unlike greedy methods that focus solely on personalization while ignoring load imbalance, FLEX-MoE addresses expert utilization skew, which is particularly severe in heterogeneous edge FL. Our experimental results demonstrate superior accuracy and consistently balanced expert utilization across diverse resource-constrained scenarios for edge computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLEX-MoE frames a joint fitness-plus-balance optimization for federated MoE on edge clients, but the abstract gives no evidence the solver stays cheap enough to run locally.

read the letter

The main point is that the paper adds client-expert fitness scores from training feedback and then runs an optimization step to push both local specialization and system-wide load balance in a federated MoE setting. That combination is the concrete addition over plain greedy assignment or standard FL-MoE work. It directly targets the two edge constraints mentioned: clients cannot hold every expert, and non-IID data quickly skews which experts get used. The abstract shows they ran tests in resource-constrained scenarios and report gains in accuracy plus more even utilization, which is the part that could be useful to people building wireless or IoT deployments. The framing itself is clear and the motivation matches real deployment pain points. The soft spot is exactly the one the stress-test note flags. There is no complexity bound, iteration count, or communication volume given for the assignment algorithm, so it is impossible to judge whether the optimization stays tractable on typical edge hardware. The abstract also skips any equations, baseline list, or error bars, which leaves the performance claims uncheckable from what is here. If the full paper supplies a lightweight solver or shows the overhead stays low, that would fix the gap; right now it is the load-bearing unverified piece. This paper is for the federated-learning-for-edge crowd rather than the core MoE theory group. A reader working on heterogeneous IoT systems could extract the fitness-score idea and test it even if the full optimization turns out heavier than hoped. I would send it to peer review. The problem is practical and the proposed direction is a reasonable next step, so referees can check the missing implementation details and run their own cost measurements.

Referee Report

2 major / 2 minor

Summary. The paper proposes FLEX-MoE, a federated Mixture-of-Experts (MoE) framework for resource-constrained edge computing. It introduces client-expert fitness scores derived from local training feedback to quantify suitability and employs an optimization-based assignment algorithm that jointly maximizes client-expert specialization while enforcing system-wide load balance. The approach targets non-IID data distributions that cause expert utilization skew, claiming superior accuracy and consistently balanced expert utilization over greedy personalization methods in heterogeneous wireless/IoT scenarios.

Significance. If the optimization procedure and fitness scores prove effective and efficient, the work could meaningfully advance federated deployment of scalable MoE models on edge networks by addressing both storage limits and load imbalance. The joint optimization of specialization and balance is a relevant distinction from prior FL-MoE efforts focused primarily on personalization.

major comments (2)

[Section describing the optimization-based algorithm] The central claim that the optimization-based assignment algorithm can be executed efficiently on resource-constrained clients rests on an unverified assumption. The manuscript provides no complexity bounds, approximation guarantees, runtime measurements, or decentralized implementation details for the constrained assignment problem, leaving open whether it remains tractable on IoT-scale hardware (see skeptic note on feasibility).
[Method description of client-expert fitness scores] The abstract states that fitness scores 'reliably predict long-term expert utility' via training feedback, yet no analysis of potential bias, instability, or long-term drift in these scores is presented. This is load-bearing for the specialization-maximization claim under non-IID conditions.

minor comments (2)

[Abstract] The abstract would benefit from a brief statement of the number of experts, clients, and datasets used in the reported experiments to allow immediate assessment of the scope of the 'diverse resource-constrained scenarios' claim.
[Method] Notation for the fitness scores and the objective function of the optimization algorithm should be introduced with explicit mathematical definitions rather than descriptive text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the optimization algorithm's feasibility and the analysis of fitness scores. We address each major comment below.

read point-by-point responses

Referee: [Section describing the optimization-based algorithm] The central claim that the optimization-based assignment algorithm can be executed efficiently on resource-constrained clients rests on an unverified assumption. The manuscript provides no complexity bounds, approximation guarantees, runtime measurements, or decentralized implementation details for the constrained assignment problem, leaving open whether it remains tractable on IoT-scale hardware (see skeptic note on feasibility).

Authors: We agree that the manuscript lacks explicit complexity analysis and empirical validation for the assignment algorithm on constrained hardware. In the revised version, we will add a dedicated subsection providing time and space complexity bounds for the optimization procedure, along with approximation guarantees for the load-balanced assignment problem. We will also include runtime measurements obtained from simulations on representative IoT-scale hardware profiles and clarify the decentralized execution steps that rely on local client computations with minimal server coordination. revision: yes
Referee: [Method description of client-expert fitness scores] The abstract states that fitness scores 'reliably predict long-term expert utility' via training feedback, yet no analysis of potential bias, instability, or long-term drift in these scores is presented. This is load-bearing for the specialization-maximization claim under non-IID conditions.

Authors: Fitness scores are derived directly from per-expert loss feedback during local training to capture suitability under each client's data distribution. While experiments demonstrate improved specialization and accuracy, we acknowledge the absence of dedicated analysis on score stability or bias. In the revision, we will incorporate an empirical study of score evolution across communication rounds, including metrics for variance and drift under varying non-IID degrees, and discuss mitigation strategies for potential biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes FLEX-MoE as a new federated MoE framework that defines client-expert fitness scores from local training feedback and introduces an optimization-based assignment algorithm to jointly maximize specialization and enforce load balance. These elements are presented as direct responses to the stated challenges of resource constraints and non-IID data, without any reduction of the central claims to prior fitted quantities, self-citations, or ansatzes by construction. The derivation remains self-contained because the fitness scores and optimization procedure are explicitly introduced as novel mechanisms grounded in the edge FL problem setup, with no equations or steps shown to be equivalent to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that non-IID data distributions produce severe expert load imbalance that can be mitigated by fitness-based assignment; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Non-IID data distributions across clients cause severe expert load imbalance that degrades model performance in federated MoE.
Stated as one of the two critical challenges the framework addresses.

pith-pipeline@v0.9.0 · 5735 in / 1286 out tokens · 38374 ms · 2026-05-21T15:31:51.538200+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide... Solve: max ∑ Q_{t-1}(c,e)·X_{c,e} subject to ∑_e X_{c,e}=k_c and L_new(e) ≤ ∑_c |D_c| X_{c,e} ≤ Γ_new(e)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

client-expert fitness scores... s(t)_{c,e}=a(t)_{c,e} or exp(−α_L ℓ(t)_{c,e})... Q_t(c,e)=(1−β)Q_{t−1}(c,e)+β s(t)_{c,e}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

$\phi$-Balancing for Mixture-of-Experts Training
cs.LG 2026-05 unverdicted novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Sparse moe as the new dropout: Scaling dense and self- slimmable transformers.arXiv preprint arXiv:2303.01610. Deng, L

work page arXiv
[2]

Fedus, W.; Zoph, B.; and Shazeer, N

Fedjets: Effi- cient just-in-time personalization with federated mixture of experts.arXiv preprint arXiv:2306.08586. Fedus, W.; Zoph, B.; and Shazeer, N

work page arXiv
[3]

Liu, X.; Xu, L.; Wu, X.; Zhang, S.; and Wang, L

Recurrent early exits for federated learning with heterogeneous clients.arXiv preprint arXiv:2405.14791. Liu, X.; Xu, L.; Wu, X.; Zhang, S.; and Wang, L

work page arXiv
[4]

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B

Mixture of Experts Made Personalized: Federated Prompt Learning for Vision- Language Models.arXiv preprint arXiv:2410.10114. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A

work page arXiv
[5]

Fedmoe: Personal- ized federated learning via heterogeneous mixture of experts,

Fed- MoE: Personalized Federated Learning via Heterogeneous Mixture of Experts.arXiv preprint arXiv:2408.11304. Nauss, R. M

work page arXiv
[6]

Federated mixture of experts.arXiv preprint arXiv:2107.06724. Ross, G. T.; and Soland, R. M

work page arXiv
[7]

InThe 2011 international joint conference on neural networks, 1453–1460

The German traffic sign recognition benchmark: a multi- class classification competition. InThe 2011 international joint conference on neural networks, 1453–1460. IEEE. Tran, V .-T.; Pham, Q.-V .; et al

work page 2011
[8]

InICLR 2025 Workshop on Modular- ity for Collaborative, Decentralized, and Continual Deep Learning

Revisiting Sparse Mix- ture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models. InICLR 2025 Workshop on Modular- ity for Collaborative, Decentralized, and Continual Deep Learning. Wang, L.; Gao, H.; Zhao, C.; Sun, X.; and Dai, D

work page 2025
[9]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Auxiliary-loss-free load balancing strategy for mixture-of- experts.arXiv preprint arXiv:2408.15664. Yi, L.; Yu, H.; Ren, C.; Zhang, H.; Wang, G.; Liu, X.; and Li, X

work page internal anchor Pith review Pith/arXiv arXiv
[10]

pFedMoE: Data-level personalization with mixture of experts for model-heterogeneous personalized federated learning.arXiv preprint arXiv:2402.01350. Zec, E. L.; Mogren, O.; Martinsson, J.; S ¨utfeld, L. R.; and Gillblad, D

work page arXiv
[11]

Zhan, Z.; Zhao, W.; Li, Y .; Liu, W.; Zhang, X.; Tan, C

Specialized federated learning using a mixture of experts.arXiv preprint arXiv:2010.02056. Zhan, Z.; Zhao, W.; Li, Y .; Liu, W.; Zhang, X.; Tan, C. W.; Wu, C.; Guo, D.; and Chen, X

work page arXiv 2010
[12]

FedMoE-DA: Fed- erated Mixture of Experts via Domain Aware Fine-grained Aggregation.arXiv preprint arXiv:2411.02115

work page arXiv

[1] [1]

Sparse moe as the new dropout: Scaling dense and self- slimmable transformers.arXiv preprint arXiv:2303.01610. Deng, L

work page arXiv

[2] [2]

Fedus, W.; Zoph, B.; and Shazeer, N

Fedjets: Effi- cient just-in-time personalization with federated mixture of experts.arXiv preprint arXiv:2306.08586. Fedus, W.; Zoph, B.; and Shazeer, N

work page arXiv

[3] [3]

Liu, X.; Xu, L.; Wu, X.; Zhang, S.; and Wang, L

Recurrent early exits for federated learning with heterogeneous clients.arXiv preprint arXiv:2405.14791. Liu, X.; Xu, L.; Wu, X.; Zhang, S.; and Wang, L

work page arXiv

[4] [4]

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B

Mixture of Experts Made Personalized: Federated Prompt Learning for Vision- Language Models.arXiv preprint arXiv:2410.10114. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A

work page arXiv

[5] [5]

Fedmoe: Personal- ized federated learning via heterogeneous mixture of experts,

Fed- MoE: Personalized Federated Learning via Heterogeneous Mixture of Experts.arXiv preprint arXiv:2408.11304. Nauss, R. M

work page arXiv

[6] [6]

Federated mixture of experts.arXiv preprint arXiv:2107.06724. Ross, G. T.; and Soland, R. M

work page arXiv

[7] [7]

InThe 2011 international joint conference on neural networks, 1453–1460

The German traffic sign recognition benchmark: a multi- class classification competition. InThe 2011 international joint conference on neural networks, 1453–1460. IEEE. Tran, V .-T.; Pham, Q.-V .; et al

work page 2011

[8] [8]

InICLR 2025 Workshop on Modular- ity for Collaborative, Decentralized, and Continual Deep Learning

Revisiting Sparse Mix- ture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models. InICLR 2025 Workshop on Modular- ity for Collaborative, Decentralized, and Continual Deep Learning. Wang, L.; Gao, H.; Zhao, C.; Sun, X.; and Dai, D

work page 2025

[9] [9]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Auxiliary-loss-free load balancing strategy for mixture-of- experts.arXiv preprint arXiv:2408.15664. Yi, L.; Yu, H.; Ren, C.; Zhang, H.; Wang, G.; Liu, X.; and Li, X

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

pFedMoE: Data-level personalization with mixture of experts for model-heterogeneous personalized federated learning.arXiv preprint arXiv:2402.01350. Zec, E. L.; Mogren, O.; Martinsson, J.; S ¨utfeld, L. R.; and Gillblad, D

work page arXiv

[11] [11]

Zhan, Z.; Zhao, W.; Li, Y .; Liu, W.; Zhang, X.; Tan, C

Specialized federated learning using a mixture of experts.arXiv preprint arXiv:2010.02056. Zhan, Z.; Zhao, W.; Li, Y .; Liu, W.; Zhang, X.; Tan, C. W.; Wu, C.; Guo, D.; and Chen, X

work page arXiv 2010

[12] [12]

FedMoE-DA: Fed- erated Mixture of Experts via Domain Aware Fine-grained Aggregation.arXiv preprint arXiv:2411.02115

work page arXiv