FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning

Bongjun Choi; Taehwan Yoon; Wesley De Neve

arxiv: 2506.23210 · v5 · submitted 2025-06-29 · 💻 cs.LG · cs.AI· cs.DC

FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning

Taehwan Yoon , Bongjun Choi , Wesley De Neve This is my paper

Pith reviewed 2026-05-19 07:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords Federated LearningCatastrophic ForgettingBayesian Fine-TuningHeterogeneous DataReference ModelMAP RegularizationNon-IID Settings

0 comments

The pith

FedRef builds a reference model from past global models and applies server-side MAP regularization to curb forgetting in non-IID federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedRef to handle data and system heterogeneity in federated learning, which typically leads to catastrophic forgetting and unstable updates. It creates a reference model by aggregating previous global models and adds a MAP-based regularization term that pulls new updates toward this reference. All regularization and fine-tuning happen on the server, so clients only perform standard local training. Tests on FEMNIST and CINIC-10 for classification plus FeTS2022 for medical segmentation show higher accuracy and quicker convergence while cutting client computation.

Core claim

FedRef integrates a MAP-based regularization term that calibrates global model updates toward a temporally aggregated reference model, thereby mitigating catastrophic forgetting and improving update stability. Unlike prior approaches, FedRef performs all fine-tuning operations on the server side, reducing client-side computational overhead while maintaining effective global optimization.

What carries the argument

A temporally aggregated reference model paired with server-side MAP regularization that steers client updates toward past global models.

If this is right

Higher final accuracy on image classification under non-IID partitions such as FEMNIST and CINIC-10.
Faster convergence measured by rounds needed to reach target performance.
Lower client-side compute and communication costs relative to methods that add client-side regularization.
Better segmentation results on medical imaging tasks like FeTS2022 without extra client burden.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The server-centric design may allow deployment on edge devices with very limited local resources.
The reference-model idea could be adapted to continual learning settings where tasks arrive sequentially.
Aggregating historical global models might offer a lightweight way to stabilize training in other distributed optimization problems.

Load-bearing premise

That constructing a reference model from previous global models and applying MAP-based regularization on the server will reliably mitigate catastrophic forgetting and unbounded drift without introducing new biases or reducing overall model capacity in heterogeneous settings.

What would settle it

If FedRef is tested on a fresh heterogeneous non-IID dataset and shows no improvement in final accuracy or convergence speed over FedAvg, the core claim would be falsified.

Figures

Figures reproduced from arXiv: 2506.23210 by Bongjun Choi, Taehwan Yoon, Wesley De Neve.

**Figure 2.** Figure 2: Data partitioning of the FeTS2022 dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Graph of server-side centralized loss on each strategy and data. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Graph of server-side centralized Dice-coefficient(DCE) and Hausdorff distance. Left: DCE, Right: HF95 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, data and system heterogeneity often cause catastrophic forgetting and unbounded drift in model updates, leading to degraded predictive performance and increased client-side computation. To address these challenges, we propose FedRef, a Bayesian fine-tuning method that leverages a reference model constructed from previous global models. FedRef integrates a MAP-based regularization term that calibrates global model updates toward a temporally aggregated reference model, thereby mitigating catastrophic forgetting and improving update stability. Unlike prior approaches, FedRef performs all fine-tuning operations on the server side, reducing client-side computational overhead while maintaining effective global optimization. Experiments on image classification (FEMNIST, CINIC-10) and medical image segmentation (FeTS2022) demonstrate that FedRef achieves superior predictive performance and faster convergence under heterogeneous, non-IID settings, while significantly lowering client-side computation compared with existing methods. These results highlight FedRef as an efficient and robust optimization framework for real-world FL scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes FedRef, a Bayesian fine-tuning method for heterogeneous federated learning. It constructs a reference model from temporally aggregated previous global models and applies a MAP-based regularization term on the server to calibrate client updates, with the goal of mitigating catastrophic forgetting and unbounded drift while shifting computation away from clients. Experiments on FEMNIST and CINIC-10 for image classification and FeTS2022 for medical image segmentation are claimed to show superior predictive performance, faster convergence, and reduced client-side overhead under non-IID conditions compared to existing methods.

Significance. If the server-side reference-model regularization can be shown to bound client drift without introducing misalignment biases or capacity loss, the approach would offer a practical efficiency gain for real-world FL deployments. The shift of fine-tuning to the server is a clear strength for reducing client compute. However, the absence of explicit derivations, ablations on the reference construction, or statistical validation of the performance claims limits the assessed significance to moderate at present.

major comments (2)

[Abstract / Method] Abstract and method description: the central claim that a server-only MAP penalty to a temporally aggregated reference model bounds client-specific catastrophic forgetting rests on the assumption that a global historical average aligns sufficiently with divergent local distributions. No derivation or client-drift analysis is provided to show that this term dominates local update terms under high heterogeneity; the server never observes local data, so the regularization reduces to a global pull that may leave individual forgetting unaddressed.
[Experiments] Experiments section: the reported superior performance on FEMNIST, CINIC-10, and FeTS2022 lacks any description of the exact baselines, regularization hyper-parameters, statistical significance tests, or ablation studies isolating the reference-model component. Without these, the empirical support for faster convergence and lower client computation cannot be verified and does not yet substantiate the load-bearing claims.

minor comments (1)

[Method] Notation for the reference model aggregation and the precise form of the MAP term should be introduced with an equation early in the method section to allow readers to follow the Bayesian framing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that a server-only MAP penalty to a temporally aggregated reference model bounds client-specific catastrophic forgetting rests on the assumption that a global historical average aligns sufficiently with divergent local distributions. No derivation or client-drift analysis is provided to show that this term dominates local update terms under high heterogeneity; the server never observes local data, so the regularization reduces to a global pull that may leave individual forgetting unaddressed.

Authors: We agree that an explicit derivation and client-drift analysis would strengthen the presentation. Although the server does not observe local data directly, the temporally aggregated reference model encodes historical consensus from prior rounds of client updates, providing a stable global anchor. In the revised manuscript we will add a derivation of the regularized server update and a brief analysis showing how the MAP term limits unbounded drift even under high heterogeneity. We will also note the inherent limitation that the regularization is global rather than client-specific and discuss its empirical effectiveness in mitigating forgetting across the federation. revision: yes
Referee: [Experiments] Experiments section: the reported superior performance on FEMNIST, CINIC-10, and FeTS2022 lacks any description of the exact baselines, regularization hyper-parameters, statistical significance tests, or ablation studies isolating the reference-model component. Without these, the empirical support for faster convergence and lower client computation cannot be verified and does not yet substantiate the load-bearing claims.

Authors: We concur that additional experimental details are required for reproducibility and verification. In the revised version we will expand the experiments section to specify: (i) exact implementations and hyper-parameter settings of all baselines, (ii) the regularization coefficients and aggregation window sizes used for the reference model, (iii) statistical significance testing (e.g., paired t-tests over multiple random seeds), and (iv) ablation studies that isolate the contribution of the reference-model construction versus the MAP term. These additions will directly address the referee’s concerns about substantiating the performance and efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method relies on external experimental validation

full rationale

The paper describes FedRef as a server-side Bayesian fine-tuning approach that constructs a reference model from prior global models and applies MAP regularization to mitigate forgetting in heterogeneous FL. No equations, derivations, or self-referential definitions appear in the provided abstract or method summary that reduce a claimed prediction or result to a fitted input or self-citation by construction. The central claims are supported by experiments on FEMNIST, CINIC-10, and FeTS2022 rather than internal redefinitions, rendering the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the reference model and MAP term are presented as core components but their precise definitions and assumptions are not detailed.

pith-pipeline@v0.9.0 · 5720 in / 1047 out tokens · 22763 ms · 2026-05-19T07:58:44.598106+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

[1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics , pp. 1273–1282, PMLR, 2017

work page 2017
[2]

Catastrophic interference in connec- tionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” in Psychology of learning and motivation , vol. 24, pp. 109–165, Elsevier, 1989

work page 1989
[3]

Catastrophic forgetting in connectionist networks,

R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences , vol. 3, no. 4, pp. 128–135, 1999

work page 1999
[4]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y . Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

Reducing the hausdorff distance in medical image segmentation with convolutional neural networks,

D. Karimi and S. E. Salcudean, “Reducing the hausdorff distance in medical image segmentation with convolutional neural networks,” IEEE Transactions on medical imaging , vol. 39, no. 2, pp. 499–513, 2019

work page 2019
[6]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Unetr: Transformers for 3d medical image segmentation,

A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Land- man, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision , pp. 574–584, 2022

work page 2022
[8]

Medical lesion segmentation by combining multimodal images with modality weighted unet,

X. Zhu, Y . Wu, H. Hu, X. Zhuang, J. Yao, D. Ou, W. Li, M. Song, N. Feng, and D. Xu, “Medical lesion segmentation by combining multimodal images with modality weighted unet,” Medical physics , vol. 49, no. 6, pp. 3692–3704, 2022

work page 2022
[9]

Revisiting Distributed Synchronous SGD

J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” arXiv preprint arXiv:1604.00981 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems , vol. 2, pp. 429–450, 2020

work page 2020
[11]

Adaptive federated optimization,

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” 2021. Fig. 3. Graph of server-side centralized loss on each strategy and data. Fig. 4. Graph of server-side centralized Dice-coefficient(DCE) and Hausdorff distance. Left: DCE, Right: HF95

work page 2021
[12]

Feddyn: A dynamic and efficient federated distillation approach on recommender system,

C. Jin, X. Chen, Y . Gu, and Q. Li, “Feddyn: A dynamic and efficient federated distillation approach on recommender system,” in 2022 IEEE 28th international conference on parallel and distributed systems (IC- PADS), pp. 786–793, IEEE, 2023

work page 2022
[13]

Bayesian parameter-efficient fine-tuning for overcoming catastrophic forgetting,

H. Chen and P. N. Garner, “Bayesian parameter-efficient fine-tuning for overcoming catastrophic forgetting,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024

work page 2024
[14]

A practical bayesian framework for backpropagation networks,

D. J. MacKay, “A practical bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992

work page 1992
[15]

Explicit inductive bias for transfer learning with convolutional networks,

L. Xuhong, Y . Grandvalet, and F. Davoine, “Explicit inductive bias for transfer learning with convolutional networks,” in International conference on machine learning , pp. 2825–2834, PMLR, 2018

work page 2018
[16]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. , “Overcoming catastrophic forgetting in neural networks,” Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521– 3526, 2017

work page 2017
[17]

Optimizing neural networks with kronecker- factored approximate curvature,

J. Martens and R. Grosse, “Optimizing neural networks with kronecker- factored approximate curvature,” inInternational conference on machine learning, pp. 2408–2417, PMLR, 2015

work page 2015
[18]

Fast approximate natural gradient descent in a kronecker factored eigenbasis,

T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent, “Fast approximate natural gradient descent in a kronecker factored eigenbasis,” Advances in neural information processing systems , vol. 31, 2018

work page 2018
[19]

Flower: A friendly federated learn- ing research framework.arXiv preprint arXiv:2007.14390, 2020

D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜ao, et al. , “Flower: A friendly federated learning research framework,” arXiv preprint arXiv:2007.14390, 2020

work page arXiv 2007
[20]

Mnist handwritten digit database,

Y . LeCun, C. Cortes, and C. J. Burges, “Mnist handwritten digit database,” 2010

work page 2010
[21]

Searching for mobilenetv3,

A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y . Zhu, R. Pang, V . Vasudevan,et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324, 2019

work page 2019
[22]

Federated Tumor Segmentation (FeTS) 2022 Dataset

FeTS Initiative, “Federated Tumor Segmentation (FeTS) 2022 Dataset.” https://www.synapse.org/fets2022, 2022. Accessed: 2025-06-21

work page 2022
[23]

Ridge-based vessel segmentation in color images of the retina,

J. Staal, M. D. Abr `amoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken, “Ridge-based vessel segmentation in color images of the retina,” IEEE transactions on medical imaging , vol. 23, no. 4, pp. 501–509, 2004

work page 2004

[1] [1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics , pp. 1273–1282, PMLR, 2017

work page 2017

[2] [2]

Catastrophic interference in connec- tionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” in Psychology of learning and motivation , vol. 24, pp. 109–165, Elsevier, 1989

work page 1989

[3] [3]

Catastrophic forgetting in connectionist networks,

R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences , vol. 3, no. 4, pp. 128–135, 1999

work page 1999

[4] [4]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y . Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[5] [5]

Reducing the hausdorff distance in medical image segmentation with convolutional neural networks,

D. Karimi and S. E. Salcudean, “Reducing the hausdorff distance in medical image segmentation with convolutional neural networks,” IEEE Transactions on medical imaging , vol. 39, no. 2, pp. 499–513, 2019

work page 2019

[6] [6]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Unetr: Transformers for 3d medical image segmentation,

A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Land- man, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision , pp. 574–584, 2022

work page 2022

[8] [8]

Medical lesion segmentation by combining multimodal images with modality weighted unet,

X. Zhu, Y . Wu, H. Hu, X. Zhuang, J. Yao, D. Ou, W. Li, M. Song, N. Feng, and D. Xu, “Medical lesion segmentation by combining multimodal images with modality weighted unet,” Medical physics , vol. 49, no. 6, pp. 3692–3704, 2022

work page 2022

[9] [9]

Revisiting Distributed Synchronous SGD

J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” arXiv preprint arXiv:1604.00981 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems , vol. 2, pp. 429–450, 2020

work page 2020

[11] [11]

Adaptive federated optimization,

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” 2021. Fig. 3. Graph of server-side centralized loss on each strategy and data. Fig. 4. Graph of server-side centralized Dice-coefficient(DCE) and Hausdorff distance. Left: DCE, Right: HF95

work page 2021

[12] [12]

Feddyn: A dynamic and efficient federated distillation approach on recommender system,

C. Jin, X. Chen, Y . Gu, and Q. Li, “Feddyn: A dynamic and efficient federated distillation approach on recommender system,” in 2022 IEEE 28th international conference on parallel and distributed systems (IC- PADS), pp. 786–793, IEEE, 2023

work page 2022

[13] [13]

Bayesian parameter-efficient fine-tuning for overcoming catastrophic forgetting,

H. Chen and P. N. Garner, “Bayesian parameter-efficient fine-tuning for overcoming catastrophic forgetting,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024

work page 2024

[14] [14]

A practical bayesian framework for backpropagation networks,

D. J. MacKay, “A practical bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992

work page 1992

[15] [15]

Explicit inductive bias for transfer learning with convolutional networks,

L. Xuhong, Y . Grandvalet, and F. Davoine, “Explicit inductive bias for transfer learning with convolutional networks,” in International conference on machine learning , pp. 2825–2834, PMLR, 2018

work page 2018

[16] [16]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. , “Overcoming catastrophic forgetting in neural networks,” Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521– 3526, 2017

work page 2017

[17] [17]

Optimizing neural networks with kronecker- factored approximate curvature,

J. Martens and R. Grosse, “Optimizing neural networks with kronecker- factored approximate curvature,” inInternational conference on machine learning, pp. 2408–2417, PMLR, 2015

work page 2015

[18] [18]

Fast approximate natural gradient descent in a kronecker factored eigenbasis,

T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent, “Fast approximate natural gradient descent in a kronecker factored eigenbasis,” Advances in neural information processing systems , vol. 31, 2018

work page 2018

[19] [19]

Flower: A friendly federated learn- ing research framework.arXiv preprint arXiv:2007.14390, 2020

D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜ao, et al. , “Flower: A friendly federated learning research framework,” arXiv preprint arXiv:2007.14390, 2020

work page arXiv 2007

[20] [20]

Mnist handwritten digit database,

Y . LeCun, C. Cortes, and C. J. Burges, “Mnist handwritten digit database,” 2010

work page 2010

[21] [21]

Searching for mobilenetv3,

A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y . Zhu, R. Pang, V . Vasudevan,et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324, 2019

work page 2019

[22] [22]

Federated Tumor Segmentation (FeTS) 2022 Dataset

FeTS Initiative, “Federated Tumor Segmentation (FeTS) 2022 Dataset.” https://www.synapse.org/fets2022, 2022. Accessed: 2025-06-21

work page 2022

[23] [23]

Ridge-based vessel segmentation in color images of the retina,

J. Staal, M. D. Abr `amoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken, “Ridge-based vessel segmentation in color images of the retina,” IEEE transactions on medical imaging , vol. 23, no. 4, pp. 501–509, 2004

work page 2004