Breaking the Capacity Bottleneck in Model-Heterogeneous Federated Learning via Gradual Model Restoration

Chengjie Ma; Jihong Park; Seong-Lyun Kim; Seungeun Oh

arxiv: 2512.05372 · v2 · submitted 2025-12-05 · 💻 cs.DC

Breaking the Capacity Bottleneck in Model-Heterogeneous Federated Learning via Gradual Model Restoration

Chengjie Ma , Seungeun Oh , Jihong Park , Seong-Lyun Kim This is my paper

Pith reviewed 2026-05-17 01:49 UTC · model grok-4.3

classification 💻 cs.DC

keywords federated learningmodel heterogeneitygradual model restorationbandwidth constrained clientsconvergence analysisnon-IID dataasynchronous aggregation

0 comments

The pith

Gradually increasing sub-model sizes during training lets bandwidth-constrained clients stay effective in model-heterogeneous federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In model-heterogeneous federated learning, clients limited by uplink bandwidth often start with small fixed sub-models that help early on but leave them under-parameterized later, slowing overall progress. The paper shows that progressively restoring model density for each client keeps these bandwidth-constrained participants contributing meaningfully as training advances. This is achieved through an asynchronous workflow with mask-aware aggregation that handles varying client capabilities without extra divergence. Convergence analysis establishes that aggregation error grows with the average sub-model density across clients and rounds, while gradual restoration narrows the performance difference to full-model federated learning. Experiments on image and text datasets confirm faster convergence and higher final accuracy, particularly when data is non-IID and heterogeneity is severe.

Core claim

FedGMR centers on Gradual Model Restoration, which progressively increases each client's sub-model density during training so that bandwidth-constrained clients remain effective contributors throughout optimization. The framework implements this via asynchronous coordination and stable mask-aware aggregation. Convergence guarantees demonstrate that the aggregation error scales with the average sub-model density across clients and rounds, and that GMR provably reduces the gap to full-model federated learning.

What carries the argument

Gradual Model Restoration (GMR), the process of progressively increasing each client's sub-model density over training rounds to avoid late-stage under-parameterization.

If this is right

FedGMR yields faster convergence and higher final accuracy than fixed sub-model baselines, especially under severe heterogeneity.
Aggregation error is bounded by the average sub-model density across clients and rounds.
GMR narrows the performance gap to full-model federated learning in a provable manner.
The end-to-end workflow with asynchronous coordination and mask-aware aggregation supports practical deployment across heterogeneous devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradual-density idea could be tested in settings where clients face compute or memory limits rather than only bandwidth limits.
Dynamic model-size growth might be combined with differential privacy or quantization to study joint effects on utility and communication cost.
Large-scale simulations with real device traces would check whether the convergence scaling holds when client availability fluctuates more than in the reported experiments.

Load-bearing premise

Mask-aware aggregation stays stable and asynchronous coordination adds no extra divergence as sub-model densities change gradually across heterogeneous clients.

What would settle it

An experiment in which gradually increasing sub-model density produces worse final accuracy or slower convergence than keeping fixed small sub-models under identical bandwidth limits and non-IID data.

Figures

Figures reproduced from arXiv: 2512.05372 by Chengjie Ma, Jihong Park, Seong-Lyun Kim, Seungeun Oh.

**Figure 1.** Figure 1: GMR: client models are heterogeneous but are gradually restored during training. Gradual Model Restoration. To address this challenge, we propose FedGMR, an asynchronous MHFL framework built upon Gradual Model Restoration (GMR). GMR initializes BCCs with compact sub-models to maximize early-round efficiency and progressively restores their structures as optimization proceeds, dynamically increasing model c… view at source ↗

**Figure 2.** Figure 2: The accuracy growth rate with different model densities. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics under heterogeneous bandwidth: top row shows training steps over [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗

read the original abstract

Federated learning (FL) enables distributed model training, yet in heterogeneous deployments, Bandwidth-Constrained Clients (BCCs) often contribute inefficiently due to limited uplink bandwidth. In model-heterogeneous FL with fixed small sub-models, BCCs may improve quickly in early rounds but become under-parameterized later, resulting in slow convergence and poor generalization. To address this challenge, we propose FedGMR, a federated learning framework centered around Gradual Model Restoration (GMR), where GMR progressively increases each client's sub-model density during training, allowing BCCs to remain effective contributors throughout optimization. To make GMR practical under real-world heterogeneity, FedGMR is realized as an end-to-end workflow with asynchronous coordination and stable, mask-aware aggregation. We further establish convergence guarantees, showing that the aggregation error scales with the average sub-model density across clients and rounds, and that GMR provably narrows the gap toward full-model FL. Extensive experiments on FEMNIST, CIFAR-10, ImageNet-100, and StackOverflow demonstrate that FedGMR improves both convergence speed and final accuracy, especially under severe heterogeneity and non-IID data distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedGMR gradually ramps up sub-model density for bandwidth-limited clients and shows experimental gains, but the convergence claim likely misses an extra error term from changing masks under async updates.

read the letter

The main point is that this work lets low-bandwidth clients begin with small sub-models and slowly increase their density as training goes on. That keeps them useful later in the process instead of becoming bottlenecks once the model needs more capacity. Experiments across FEMNIST, CIFAR-10, ImageNet-100, and StackOverflow report faster convergence and better final accuracy under strong heterogeneity and non-IID data, which matches the practical problem they set out to solve.

Referee Report

2 major / 2 minor

Summary. The paper introduces FedGMR, a model-heterogeneous federated learning framework that uses Gradual Model Restoration (GMR) to progressively increase sub-model density for bandwidth-constrained clients. It incorporates asynchronous coordination and mask-aware aggregation, and claims convergence guarantees where aggregation error scales with the average sub-model density across clients and rounds, with GMR narrowing the gap to full-model FL. Experiments on FEMNIST, CIFAR-10, ImageNet-100, and StackOverflow show gains in convergence speed and accuracy under heterogeneity and non-IID data.

Significance. If the convergence analysis holds without unaccounted error terms from asynchronous mask changes and the experimental gains are reproducible with proper controls, the approach could meaningfully improve client utilization in bandwidth-limited heterogeneous FL settings by avoiding early under-parameterization of small fixed sub-models.

major comments (2)

[§4] §4 (Convergence Analysis): The claim that aggregation error scales only with average sub-model density requires explicit handling of asynchronous mask updates; if masks change gradually across delayed client updates, an additional mismatch term in the averaged parameters (arising from differing supports) may appear that grows with the density-increase rate rather than being bounded solely by the average density. The current bound appears to treat masks as fixed within rounds.
[Theorem 1] Theorem 1 (or equivalent convergence statement): The proof sketch ties the error directly to the GMR-controlled density schedule, raising a circularity concern; it is unclear whether the density progression is derived independently of observed error or tuned to ensure the bound holds, which would weaken the guarantee that GMR provably narrows the gap to full-model FL.

minor comments (2)

[Experiments] The abstract and experiments section should report error bars or standard deviations over multiple runs to substantiate the claimed improvements in convergence speed and final accuracy.
[Method] Notation for mask-aware aggregation and asynchronous coordination should be defined more explicitly early in the method section to clarify how stability is maintained as densities increase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments on the convergence analysis highlight important aspects of handling asynchrony and the independence of the GMR schedule. We address each point below with clarifications and proposed revisions where appropriate. We believe these responses strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis): The claim that aggregation error scales only with average sub-model density requires explicit handling of asynchronous mask updates; if masks change gradually across delayed client updates, an additional mismatch term in the averaged parameters (arising from differing supports) may appear that grows with the density-increase rate rather than being bounded solely by the average density. The current bound appears to treat masks as fixed within rounds.

Authors: We appreciate this observation regarding asynchronous mask updates. Our analysis in Section 4 derives the aggregation error bound under the assumption that mask updates occur synchronously at round boundaries for the purpose of bounding the expected density, with the gradual nature of GMR ensuring changes are small per round. However, we acknowledge that delayed client updates could introduce a support mismatch term. To address this rigorously, we will add a supporting lemma in the revised Section 4 that explicitly bounds this additional term by the maximum density-increase rate per round (which is controlled by the GMR hyperparameter), showing it remains O(1/R) where R is the number of rounds and does not grow with the schedule. This preserves the scaling with average sub-model density while handling asynchrony. We agree this requires explicit treatment. revision: yes
Referee: [Theorem 1] Theorem 1 (or equivalent convergence statement): The proof sketch ties the error directly to the GMR-controlled density schedule, raising a circularity concern; it is unclear whether the density progression is derived independently of observed error or tuned to ensure the bound holds, which would weaken the guarantee that GMR provably narrows the gap to full-model FL.

Authors: We clarify that there is no circularity in the proof. The GMR density progression is a fixed, a priori schedule determined solely by each client's bandwidth constraint and a predefined linear (or similar) increase function over rounds; it does not depend on observed loss, gradients, or error values during training. The convergence statement in Theorem 1 then shows that, for any such fixed schedule, the resulting aggregation error is strictly smaller than the error under a fixed low-density sub-model (as in prior model-heterogeneous FL methods), thereby narrowing the gap to full-model FL. We will add an explicit remark after the theorem statement in the revised manuscript to state this independence and reference the schedule definition in Section 3.1. revision: partial

Circularity Check

0 steps flagged

No significant circularity; convergence bound expressed in terms of controllable density parameter

full rationale

The paper's central theoretical claim is a convergence guarantee in which aggregation error is bounded by a term that scales with the average sub-model density across clients and rounds. This is a standard parametric bound rather than a self-referential construction: the density schedule is an explicit design choice of GMR, and the proof derives an error expression controlled by that schedule. No equation is shown to reduce to its own input by definition, no fitted parameter is relabeled as a prediction, and no load-bearing step relies on a self-citation whose content is itself unverified. The derivation remains self-contained against external FL convergence techniques once the density parameter is accepted as given. The provided abstract and reader summary contain no quoted reduction that meets the strict criteria for flagging circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a stable mask-aware aggregation operator whose error is bounded by average density, plus the feasibility of asynchronous coordination without extra divergence; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Mask-aware aggregation produces bounded error that scales monotonically with average sub-model density
Invoked in the convergence guarantee section of the abstract
domain assumption Gradual density increase can be performed without destabilizing the global model trajectory
Implicit in the claim that GMR narrows the gap to full-model FL

pith-pipeline@v0.9.0 · 5516 in / 1358 out tokens · 28986 ms · 2026-05-17T01:49:28.968335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further establish convergence guarantees, showing that the aggregation error scales with the average sub-model density across clients and rounds, and that GMR provably narrows the gap toward full-model FL.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 ... MHFL converges to a small neighborhood of a stationary point of standard FL ... terms of the form 1/K ∑ f²(p_g,k) ... decrease as client densities increase.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Leaf: A benchmark for federated settings,

Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn`y, H Brendan McMa- han, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings.arXiv preprint arXiv:1812.01097,

work page arXiv
[2]

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric P Xing. Toward understanding the impact of staleness in distributed machine learning.arXiv preprint arXiv:1810.03264,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Heterofl: Computation and communication efficient federated learning for heterogeneous clients

Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In9th International Conference on Learning Rep- resentations, ICLR 2021,

work page 2021
[4]

arXiv preprint arXiv:2404.08003 , year=

Guangchen Lan, Dong-Jun Han, Abolfazl Hashemi, Vaneet Aggarwal, and Christopher G Brinton. Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis.arXiv preprint arXiv:2404.08003,

work page arXiv
[5]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Asyncfeded: Asynchronous federated learning with euclidean distance based adaptive weight aggregation.arXiv preprint arXiv:2205.13797,

Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. Asyncfeded: Asynchronous federated learning with euclidean distance based adaptive weight aggregation.arXiv preprint arXiv:2205.13797,

work page arXiv
[7]

Fiarse: Model- heterogeneous federated learning via importance-aware submodel extraction.arXiv preprint arXiv:2407.19389,

Feijie Wu, Xingchen Wang, Yaqing Wang, Tianci Liu, Lu Su, and Jing Gao. Fiarse: Model- heterogeneous federated learning via importance-aware submodel extraction.arXiv preprint arXiv:2407.19389,

work page arXiv
[8]

Asynchronous federated optimization,

Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization.arXiv preprint arXiv:1903.03934,

work page arXiv 1903
[9]

Towards efficient asyn- chronous federated learning in heterogeneous edge environments

Yajie Zhou, Xiaoyi Pang, Zhibo Wang, Jiahui Hu, Peng Sun, and Kui Ren. Towards efficient asyn- chronous federated learning in heterogeneous edge environments. InIEEE INFOCOM 2024- IEEE conference on computer communications, pp. 2448–2457. IEEE,

work page 2024
[10]

Specifically, we used an LLM to assist in refining the language, improving readability, and ensuring clarity in various sections of the paper

13 A LLM Usage Large Language Models (LLMs) were used to aid in the writing and polishing of the manuscript. Specifically, we used an LLM to assist in refining the language, improving readability, and ensuring clarity in various sections of the paper. The model helped with tasks such as sentence rephrasing, grammar checking, and enhancing the overall flow...

work page 2021
[11]

Algorithm 3:BuffMaskFedAvg Aggregation Input:Previous modelW k−1, bufferB Output:Aggregated modelW k foreach client updatew i,k ∈ Bdo compute staleness weightβ i,k; derive maskm i,k from nonzero coordinates ofw i,k; accumulate:W cum ←W cum +β i,kwi,k,; Mcum ←M cum +β i,kmi,k; foreach parametern= 1, . . . , Ndo ifM (n) cum ̸= 0then W(n) k ←W (n) cum/M(n) c...

work page 2019
[12]

InsideIMSs, the mask setMis obtained from densitiesP k viaFMP. In practice,Mis not recomputed every round due to cost; it is refreshed only everyk rest rounds and reused in between: M← FMP(Wk,W k−1,P k),ifkmodk rest = 0, M(reuse previous),otherwise. IMSsthen uses the currentMto form increments and indices. We embedFMPinsideIMSsonly for brevity in the top-...

work page 2023
[13]

The detailed experimen- tal results are provided here for completeness

I FEASIBILITY STUDY: Optimal model density at different training stages In Section 6.1, we showed that the optimal model size grows with training. The detailed experimen- tal results are provided here for completeness. Table 7 and Fig 2 report the accuracy growth rates at different accuracy intervals across datasets. Consistent with our discussion, smalle...

work page 2048
[14]

The results confirm thatGMRpreserves performance while pro- ducing compact models. This also explains our design choice: restoration follows the natural train- ing trajectory of sub-models rather than arbitrary resizing, since repeatedly optimizing these nested Table 7: Optimal model density and corresponding accuracy interval where the highest accuracy g...

work page 2020

[1] [1]

Leaf: A benchmark for federated settings,

Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn`y, H Brendan McMa- han, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings.arXiv preprint arXiv:1812.01097,

work page arXiv

[2] [2]

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric P Xing. Toward understanding the impact of staleness in distributed machine learning.arXiv preprint arXiv:1810.03264,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Heterofl: Computation and communication efficient federated learning for heterogeneous clients

Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In9th International Conference on Learning Rep- resentations, ICLR 2021,

work page 2021

[4] [4]

arXiv preprint arXiv:2404.08003 , year=

Guangchen Lan, Dong-Jun Han, Abolfazl Hashemi, Vaneet Aggarwal, and Christopher G Brinton. Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis.arXiv preprint arXiv:2404.08003,

work page arXiv

[5] [5]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Asyncfeded: Asynchronous federated learning with euclidean distance based adaptive weight aggregation.arXiv preprint arXiv:2205.13797,

Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. Asyncfeded: Asynchronous federated learning with euclidean distance based adaptive weight aggregation.arXiv preprint arXiv:2205.13797,

work page arXiv

[7] [7]

Fiarse: Model- heterogeneous federated learning via importance-aware submodel extraction.arXiv preprint arXiv:2407.19389,

Feijie Wu, Xingchen Wang, Yaqing Wang, Tianci Liu, Lu Su, and Jing Gao. Fiarse: Model- heterogeneous federated learning via importance-aware submodel extraction.arXiv preprint arXiv:2407.19389,

work page arXiv

[8] [8]

Asynchronous federated optimization,

Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization.arXiv preprint arXiv:1903.03934,

work page arXiv 1903

[9] [9]

Towards efficient asyn- chronous federated learning in heterogeneous edge environments

Yajie Zhou, Xiaoyi Pang, Zhibo Wang, Jiahui Hu, Peng Sun, and Kui Ren. Towards efficient asyn- chronous federated learning in heterogeneous edge environments. InIEEE INFOCOM 2024- IEEE conference on computer communications, pp. 2448–2457. IEEE,

work page 2024

[10] [10]

Specifically, we used an LLM to assist in refining the language, improving readability, and ensuring clarity in various sections of the paper

13 A LLM Usage Large Language Models (LLMs) were used to aid in the writing and polishing of the manuscript. Specifically, we used an LLM to assist in refining the language, improving readability, and ensuring clarity in various sections of the paper. The model helped with tasks such as sentence rephrasing, grammar checking, and enhancing the overall flow...

work page 2021

[11] [11]

Algorithm 3:BuffMaskFedAvg Aggregation Input:Previous modelW k−1, bufferB Output:Aggregated modelW k foreach client updatew i,k ∈ Bdo compute staleness weightβ i,k; derive maskm i,k from nonzero coordinates ofw i,k; accumulate:W cum ←W cum +β i,kwi,k,; Mcum ←M cum +β i,kmi,k; foreach parametern= 1, . . . , Ndo ifM (n) cum ̸= 0then W(n) k ←W (n) cum/M(n) c...

work page 2019

[12] [12]

InsideIMSs, the mask setMis obtained from densitiesP k viaFMP. In practice,Mis not recomputed every round due to cost; it is refreshed only everyk rest rounds and reused in between: M← FMP(Wk,W k−1,P k),ifkmodk rest = 0, M(reuse previous),otherwise. IMSsthen uses the currentMto form increments and indices. We embedFMPinsideIMSsonly for brevity in the top-...

work page 2023

[13] [13]

The detailed experimen- tal results are provided here for completeness

I FEASIBILITY STUDY: Optimal model density at different training stages In Section 6.1, we showed that the optimal model size grows with training. The detailed experimen- tal results are provided here for completeness. Table 7 and Fig 2 report the accuracy growth rates at different accuracy intervals across datasets. Consistent with our discussion, smalle...

work page 2048

[14] [14]

The results confirm thatGMRpreserves performance while pro- ducing compact models. This also explains our design choice: restoration follows the natural train- ing trajectory of sub-models rather than arbitrary resizing, since repeatedly optimizing these nested Table 7: Optimal model density and corresponding accuracy interval where the highest accuracy g...

work page 2020