FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning
Pith reviewed 2026-05-19 07:58 UTC · model grok-4.3
The pith
FedRef builds a reference model from past global models and applies server-side MAP regularization to curb forgetting in non-IID federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FedRef integrates a MAP-based regularization term that calibrates global model updates toward a temporally aggregated reference model, thereby mitigating catastrophic forgetting and improving update stability. Unlike prior approaches, FedRef performs all fine-tuning operations on the server side, reducing client-side computational overhead while maintaining effective global optimization.
What carries the argument
A temporally aggregated reference model paired with server-side MAP regularization that steers client updates toward past global models.
If this is right
- Higher final accuracy on image classification under non-IID partitions such as FEMNIST and CINIC-10.
- Faster convergence measured by rounds needed to reach target performance.
- Lower client-side compute and communication costs relative to methods that add client-side regularization.
- Better segmentation results on medical imaging tasks like FeTS2022 without extra client burden.
Where Pith is reading between the lines
- The server-centric design may allow deployment on edge devices with very limited local resources.
- The reference-model idea could be adapted to continual learning settings where tasks arrive sequentially.
- Aggregating historical global models might offer a lightweight way to stabilize training in other distributed optimization problems.
Load-bearing premise
That constructing a reference model from previous global models and applying MAP-based regularization on the server will reliably mitigate catastrophic forgetting and unbounded drift without introducing new biases or reducing overall model capacity in heterogeneous settings.
What would settle it
If FedRef is tested on a fresh heterogeneous non-IID dataset and shows no improvement in final accuracy or convergence speed over FedAvg, the core claim would be falsified.
Figures
read the original abstract
Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, data and system heterogeneity often cause catastrophic forgetting and unbounded drift in model updates, leading to degraded predictive performance and increased client-side computation. To address these challenges, we propose FedRef, a Bayesian fine-tuning method that leverages a reference model constructed from previous global models. FedRef integrates a MAP-based regularization term that calibrates global model updates toward a temporally aggregated reference model, thereby mitigating catastrophic forgetting and improving update stability. Unlike prior approaches, FedRef performs all fine-tuning operations on the server side, reducing client-side computational overhead while maintaining effective global optimization. Experiments on image classification (FEMNIST, CINIC-10) and medical image segmentation (FeTS2022) demonstrate that FedRef achieves superior predictive performance and faster convergence under heterogeneous, non-IID settings, while significantly lowering client-side computation compared with existing methods. These results highlight FedRef as an efficient and robust optimization framework for real-world FL scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FedRef, a Bayesian fine-tuning method for heterogeneous federated learning. It constructs a reference model from temporally aggregated previous global models and applies a MAP-based regularization term on the server to calibrate client updates, with the goal of mitigating catastrophic forgetting and unbounded drift while shifting computation away from clients. Experiments on FEMNIST and CINIC-10 for image classification and FeTS2022 for medical image segmentation are claimed to show superior predictive performance, faster convergence, and reduced client-side overhead under non-IID conditions compared to existing methods.
Significance. If the server-side reference-model regularization can be shown to bound client drift without introducing misalignment biases or capacity loss, the approach would offer a practical efficiency gain for real-world FL deployments. The shift of fine-tuning to the server is a clear strength for reducing client compute. However, the absence of explicit derivations, ablations on the reference construction, or statistical validation of the performance claims limits the assessed significance to moderate at present.
major comments (2)
- [Abstract / Method] Abstract and method description: the central claim that a server-only MAP penalty to a temporally aggregated reference model bounds client-specific catastrophic forgetting rests on the assumption that a global historical average aligns sufficiently with divergent local distributions. No derivation or client-drift analysis is provided to show that this term dominates local update terms under high heterogeneity; the server never observes local data, so the regularization reduces to a global pull that may leave individual forgetting unaddressed.
- [Experiments] Experiments section: the reported superior performance on FEMNIST, CINIC-10, and FeTS2022 lacks any description of the exact baselines, regularization hyper-parameters, statistical significance tests, or ablation studies isolating the reference-model component. Without these, the empirical support for faster convergence and lower client computation cannot be verified and does not yet substantiate the load-bearing claims.
minor comments (1)
- [Method] Notation for the reference model aggregation and the precise form of the MAP term should be introduced with an equation early in the method section to allow readers to follow the Bayesian framing.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment below and indicate the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the central claim that a server-only MAP penalty to a temporally aggregated reference model bounds client-specific catastrophic forgetting rests on the assumption that a global historical average aligns sufficiently with divergent local distributions. No derivation or client-drift analysis is provided to show that this term dominates local update terms under high heterogeneity; the server never observes local data, so the regularization reduces to a global pull that may leave individual forgetting unaddressed.
Authors: We agree that an explicit derivation and client-drift analysis would strengthen the presentation. Although the server does not observe local data directly, the temporally aggregated reference model encodes historical consensus from prior rounds of client updates, providing a stable global anchor. In the revised manuscript we will add a derivation of the regularized server update and a brief analysis showing how the MAP term limits unbounded drift even under high heterogeneity. We will also note the inherent limitation that the regularization is global rather than client-specific and discuss its empirical effectiveness in mitigating forgetting across the federation. revision: yes
-
Referee: [Experiments] Experiments section: the reported superior performance on FEMNIST, CINIC-10, and FeTS2022 lacks any description of the exact baselines, regularization hyper-parameters, statistical significance tests, or ablation studies isolating the reference-model component. Without these, the empirical support for faster convergence and lower client computation cannot be verified and does not yet substantiate the load-bearing claims.
Authors: We concur that additional experimental details are required for reproducibility and verification. In the revised version we will expand the experiments section to specify: (i) exact implementations and hyper-parameter settings of all baselines, (ii) the regularization coefficients and aggregation window sizes used for the reference model, (iii) statistical significance testing (e.g., paired t-tests over multiple random seeds), and (iv) ablation studies that isolate the contribution of the reference-model construction versus the MAP term. These additions will directly address the referee’s concerns about substantiating the performance and efficiency claims. revision: yes
Circularity Check
No circularity detected; method relies on external experimental validation
full rationale
The paper describes FedRef as a server-side Bayesian fine-tuning approach that constructs a reference model from prior global models and applies MAP regularization to mitigate forgetting in heterogeneous FL. No equations, derivations, or self-referential definitions appear in the provided abstract or method summary that reduce a claimed prediction or result to a fitted input or self-citation by construction. The central claims are supported by experiments on FEMNIST, CINIC-10, and FeTS2022 rather than internal redefinitions, rendering the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics , pp. 1273–1282, PMLR, 2017
work page 2017
-
[2]
Catastrophic interference in connec- tionist networks: The sequential learning problem,
M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” in Psychology of learning and motivation , vol. 24, pp. 109–165, Elsevier, 1989
work page 1989
-
[3]
Catastrophic forgetting in connectionist networks,
R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences , vol. 3, no. 4, pp. 128–135, 1999
work page 1999
-
[4]
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y . Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211 , 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[5]
Reducing the hausdorff distance in medical image segmentation with convolutional neural networks,
D. Karimi and S. E. Salcudean, “Reducing the hausdorff distance in medical image segmentation with convolutional neural networks,” IEEE Transactions on medical imaging , vol. 39, no. 2, pp. 499–513, 2019
work page 2019
-
[6]
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Unetr: Transformers for 3d medical image segmentation,
A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Land- man, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision , pp. 574–584, 2022
work page 2022
-
[8]
Medical lesion segmentation by combining multimodal images with modality weighted unet,
X. Zhu, Y . Wu, H. Hu, X. Zhuang, J. Yao, D. Ou, W. Li, M. Song, N. Feng, and D. Xu, “Medical lesion segmentation by combining multimodal images with modality weighted unet,” Medical physics , vol. 49, no. 6, pp. 3692–3704, 2022
work page 2022
-
[9]
Revisiting Distributed Synchronous SGD
J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” arXiv preprint arXiv:1604.00981 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Federated optimization in heterogeneous networks,
T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems , vol. 2, pp. 429–450, 2020
work page 2020
-
[11]
Adaptive federated optimization,
S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´y, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” 2021. Fig. 3. Graph of server-side centralized loss on each strategy and data. Fig. 4. Graph of server-side centralized Dice-coefficient(DCE) and Hausdorff distance. Left: DCE, Right: HF95
work page 2021
-
[12]
Feddyn: A dynamic and efficient federated distillation approach on recommender system,
C. Jin, X. Chen, Y . Gu, and Q. Li, “Feddyn: A dynamic and efficient federated distillation approach on recommender system,” in 2022 IEEE 28th international conference on parallel and distributed systems (IC- PADS), pp. 786–793, IEEE, 2023
work page 2022
-
[13]
Bayesian parameter-efficient fine-tuning for overcoming catastrophic forgetting,
H. Chen and P. N. Garner, “Bayesian parameter-efficient fine-tuning for overcoming catastrophic forgetting,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024
work page 2024
-
[14]
A practical bayesian framework for backpropagation networks,
D. J. MacKay, “A practical bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992
work page 1992
-
[15]
Explicit inductive bias for transfer learning with convolutional networks,
L. Xuhong, Y . Grandvalet, and F. Davoine, “Explicit inductive bias for transfer learning with convolutional networks,” in International conference on machine learning , pp. 2825–2834, PMLR, 2018
work page 2018
-
[16]
Overcoming catastrophic forgetting in neural networks,
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. , “Overcoming catastrophic forgetting in neural networks,” Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521– 3526, 2017
work page 2017
-
[17]
Optimizing neural networks with kronecker- factored approximate curvature,
J. Martens and R. Grosse, “Optimizing neural networks with kronecker- factored approximate curvature,” inInternational conference on machine learning, pp. 2408–2417, PMLR, 2015
work page 2015
-
[18]
Fast approximate natural gradient descent in a kronecker factored eigenbasis,
T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent, “Fast approximate natural gradient descent in a kronecker factored eigenbasis,” Advances in neural information processing systems , vol. 31, 2018
work page 2018
-
[19]
Flower: A friendly federated learn- ing research framework.arXiv preprint arXiv:2007.14390, 2020
D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜ao, et al. , “Flower: A friendly federated learning research framework,” arXiv preprint arXiv:2007.14390, 2020
-
[20]
Mnist handwritten digit database,
Y . LeCun, C. Cortes, and C. J. Burges, “Mnist handwritten digit database,” 2010
work page 2010
-
[21]
A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y . Zhu, R. Pang, V . Vasudevan,et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324, 2019
work page 2019
-
[22]
Federated Tumor Segmentation (FeTS) 2022 Dataset
FeTS Initiative, “Federated Tumor Segmentation (FeTS) 2022 Dataset.” https://www.synapse.org/fets2022, 2022. Accessed: 2025-06-21
work page 2022
-
[23]
Ridge-based vessel segmentation in color images of the retina,
J. Staal, M. D. Abr `amoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken, “Ridge-based vessel segmentation in color images of the retina,” IEEE transactions on medical imaging , vol. 23, no. 4, pp. 501–509, 2004
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.