Privacy Preserving Machine Learning Workflow: from Anonymization to Personalized Differential Privacy Budgets in Federated Learning
Pith reviewed 2026-05-08 17:50 UTC · model grok-4.3
The pith
Personalized differential privacy budgets based on re-identification risk improve federated learning model performance over fixed budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that, within a federated learning network handling sensitive tabular records, allocating distinct differential privacy budgets to each client on the basis of a re-identification risk metric yields measurably better model performance than the conventional choice of a single global privacy budget applied uniformly across all clients.
What carries the argument
The client-specific privacy budget assignment procedure that converts each participant's re-identification risk score into an individual differential privacy parameter.
If this is right
- The workflow integrates anonymization, drift detection, and personalized budgets into one end-to-end process for tabular sensitive data.
- Client drift detection supplies a concrete mechanism for spotting and limiting poisoning attempts during federation rounds.
- Personalized budgets allow the network to avoid the uniform performance penalty that a single strict privacy level imposes on every participant.
- The experimental comparison on medical records supplies direct evidence that utility can be improved while the overall privacy accounting remains differential privacy.
Where Pith is reading between the lines
- If the risk metric can be kept stable across training rounds, the same logic could support budgets that adapt as new data arrives at each client.
- The same personalization principle might apply outside medicine once comparable risk metrics are defined for other tabular domains such as finance or census data.
- Making the risk metric itself differentially private would remove a possible information leak while still allowing the performance gains to be tested.
Load-bearing premise
A re-identification risk metric can be computed reliably from each client's own data and turned into a privacy budget without introducing bias, circular dependence on the training process, or fresh attack surfaces.
What would settle it
Re-running the medical-records experiments and finding that the personalized-budget models do not produce lower error than the fixed-budget models on both reported metrics.
Figures
read the original abstract
The growing development of artificial intelligence based solutions, together with privacy legislation, has driven the rise of the so-called privacy preserving machine learning architectures, such as federated learning. While federated learning enables model training on decentralized data preventing their sharing and centralization, it still faces several challenges related to data integrity and privacy. This paper presents a comprehensive privacy preserving federated learning workflow for sensitive tabular data, including anonymization and differential privacy techniques. We also introduce a formal definition for the concept of client drift, together with ways of detecting it to mitigate poisoning attacks. Then, we detail a complete methodology for assigning personalized privacy budgets for global differential privacy to the different clients participating in the network, based on a re-identification risk metric. The proposed methodology is presented and tested on an openly available dataset of medical records. Within the experimental setup we show that the approach based on personalized budgets, compared to the architecture including global differential privacy with fixed privacy budget, achieves a better model performance in terms of two error metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a comprehensive privacy-preserving federated learning workflow for sensitive tabular data. It combines anonymization techniques with global differential privacy, introduces a formal definition of client drift along with detection methods to mitigate poisoning attacks, and details a methodology for assigning personalized differential privacy budgets to clients based on a per-client re-identification risk metric. The approach is evaluated on an openly available medical records dataset, with the claim that personalized budgets yield better model performance than a fixed-budget global DP baseline according to two error metrics.
Significance. If the re-identification risk metric is shown to be computed independently of the training data and model parameters without introducing bias or new attack surfaces, the personalized budget assignment could meaningfully improve the utility-privacy trade-off in heterogeneous federated settings such as healthcare. The client drift formalization and detection mechanism would add a useful robustness component against poisoning. However, the current experimental evidence is insufficient to establish these benefits.
major comments (3)
- [Abstract and Experimental section] Abstract and Experimental section: The claim of superior performance on two error metrics versus fixed-budget global DP lacks any reported dataset size, exact metric definitions (e.g., MSE, MAE, or classification error), statistical significance tests, baseline implementation details, or description of how the re-identification risk metric is calculated and applied. Without these, the headline result cannot be verified or reproduced.
- [Section on personalized DP budget assignment] Section on personalized DP budget assignment: The re-identification risk metric used to derive per-client budgets must be demonstrated to be independent of the model parameters and training data itself. If the metric is derived from the same client data used for model training, the personalization step risks circular dependence, making the reported performance gain an artifact of data partitioning rather than a genuine benefit of adaptive budgeting.
- [Experimental setup] Experimental setup: No information is provided on how the risk metric is computed from client data, whether it consumes privacy budget, or whether it correlates with data properties that independently affect model utility. These omissions prevent isolation of the personalization effect from confounding factors.
minor comments (2)
- [Client drift section] The formal definition of client drift is introduced but its integration with the DP workflow and any empirical validation of the detection method against poisoning attacks are not detailed.
- Notation for privacy budgets (ε values) and risk metric should be made consistent across the methodology and experimental sections to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that several clarifications and additions are needed to strengthen the experimental claims and address potential concerns about the re-identification risk metric. We will revise the manuscript accordingly and respond point-by-point to each major comment below.
read point-by-point responses
-
Referee: [Abstract and Experimental section] Abstract and Experimental section: The claim of superior performance on two error metrics versus fixed-budget global DP lacks any reported dataset size, exact metric definitions (e.g., MSE, MAE, or classification error), statistical significance tests, baseline implementation details, or description of how the re-identification risk metric is calculated and applied. Without these, the headline result cannot be verified or reproduced.
Authors: We agree that the abstract and experimental section require additional details for reproducibility. In the revised manuscript we will report the exact dataset size, provide precise definitions of the two error metrics (including whether they are MSE, MAE, or classification error), include statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values), describe the fixed-budget baseline implementation in full, and add an explicit description of how the re-identification risk metric is calculated and mapped to per-client privacy budgets. These additions will make the performance claims verifiable. revision: yes
-
Referee: [Section on personalized DP budget assignment] Section on personalized DP budget assignment: The re-identification risk metric used to derive per-client budgets must be demonstrated to be independent of the model parameters and training data itself. If the metric is derived from the same client data used for model training, the personalization step risks circular dependence, making the reported performance gain an artifact of data partitioning rather than a genuine benefit of adaptive budgeting.
Authors: This concern is well-taken. The risk metric is computed from local client data statistics (e.g., quasi-identifier uniqueness measures) in a pre-training step that does not reference model parameters or the global training process. To eliminate any ambiguity, the revised manuscript will include a formal argument and/or proof in the personalized DP section demonstrating that the metric is independent of both the model parameters and the training data used for learning, thereby ruling out circular dependence. revision: yes
-
Referee: [Experimental setup] Experimental setup: No information is provided on how the risk metric is computed from client data, whether it consumes privacy budget, or whether it correlates with data properties that independently affect model utility. These omissions prevent isolation of the personalization effect from confounding factors.
Authors: We accept that the experimental setup description is incomplete on these points. The revised version will specify the exact procedure for computing the risk metric from client data, explicitly state that the metric computation is a non-private pre-processing step that consumes no differential privacy budget, and add discussion plus controls (e.g., correlation analysis or ablation studies) to isolate the personalization effect from other data properties that may influence utility. revision: yes
Circularity Check
No significant circularity detected; workflow is presented as empirical methodology
full rationale
The paper outlines a federated learning workflow that combines anonymization, global differential privacy, client drift detection, and a methodology for assigning personalized privacy budgets via a re-identification risk metric. The central experimental claim is an empirical comparison showing improved error metrics under personalized budgets versus fixed budgets. No equations, definitions, or self-citations are exhibited in the provided text that reduce the risk metric computation, budget assignment, or performance gain to a definitional tautology or fitted input renamed as prediction. The derivation chain is a sequence of standard privacy techniques applied to tabular medical data, with the personalization step treated as an independent methodological choice whose validity is tested experimentally rather than assumed by construction. Absent explicit reduction (e.g., metric defined from training loss or performance metric), the result remains falsifiable against external benchmarks and does not meet the threshold for circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A review of machine learning and deep learning applications,
P. P. Shinde and S. Shah, “A review of machine learning and deep learning applications,” in2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), 2018, pp. 1–6
work page 2018
-
[2]
Deep learning for healthcare: review, opportunities and challenges,
R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for healthcare: review, opportunities and challenges,”Briefings in bioinformatics, vol. 19, no. 6, pp. 1236–1246, 2018
work page 2018
-
[3]
Regulation (EU) 2016/679 of the European Parliament and of the Council
European Parliament and Council of the European Union, “Regulation (EU) 2016/679 of the European Parliament and of the Council.” 2016, [Accessed 20-05-2025]. [Online]. Available: https://data.europa.eu/eli/reg/2016/679/oj
work page 2016
-
[4]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273– 1282
work page 2017
-
[5]
A survey on federated learning,
C. Zhang, Y . Xie, H. Bai, B. Yu, W. Li, and Y . Gao, “A survey on federated learning,”Knowledge-Based Systems, vol. 216, p. 106775, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705121000381
work page 2021
-
[6]
Federated learning for medical image analysis: A survey,
H. Guan, P.-T. Yap, A. Bozoki, and M. Liu, “Federated learning for medical image analysis: A survey,”Pattern Recognition, p. 110424, 2024
work page 2024
-
[7]
A systematic review on federated learning in medical image analysis,
M. F. Sohan and A. Basalamah, “A systematic review on federated learning in medical image analysis,”IEEE Access, vol. 11, pp. 28 628– 28 644, 2023
work page 2023
-
[8]
Privacy and robustness in federated learning: Attacks and defenses,
L. Lyu, H. Yu, X. Ma, C. Chen, L. Sun, J. Zhao, Q. Yang, and P. S. Yu, “Privacy and robustness in federated learning: Attacks and defenses,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 7, pp. 8726–8746, 2024
work page 2024
-
[9]
N. Rodr ´ıguez-Barroso, D. Jim´enez-L´opez, M. V . Luz´on, F. Herrera, and E. Mart ´ınez-C´amara, “Survey on federated learning threats: Concepts, taxonomy on attacks and defences, experimental study and challenges,” Information Fusion, vol. 90, pp. 148–173, 2023
work page 2023
-
[10]
A python library to check the level of anonymity of a dataset,
J. S ´ainz-Pardo D´ıaz and ´A. L ´opez Garc´ıa, “A python library to check the level of anonymity of a dataset,”Scientific Data, vol. 9, no. 1, p. 785, 2022
work page 2022
-
[11]
J. Domingo-Ferrer, D. S ´anchez, and J. Soria-Comas,Database anonymization: privacy models, data utility, and microaggregation- based inter-model connections. Morgan & Claypool Publishers, 2016
work page 2016
-
[12]
An open source python library for anonymizing sensitive data,
J. S ´ainz-Pardo D ´ıaz and ´A. L ´opez Garc ´ıa, “An open source python library for anonymizing sensitive data,”Scientific data, vol. 11, no. 1, p. 1289, 2024
work page 2024
-
[13]
Output privacy in data mining,
T. Wang and L. Liu, “Output privacy in data mining,”ACM Transactions on Database Systems (TODS), vol. 36, no. 1, pp. 1–34, 2011
work page 2011
-
[14]
The algorithmic foundations of differential privacy,
C. Dwork, A. Rothet al., “The algorithmic foundations of differential privacy,”Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014
work page 2014
- [15]
-
[16]
T. Zhu, G. Li, W. Zhou, and S. Y . Philip,Differential privacy and applications. Springer, 2017
work page 2017
-
[17]
Federated learning with personalized local differential privacy,
G. Yang, S. Wang, and H. Wang, “Federated learning with personalized local differential privacy,” in2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), 2021, pp. 484–489
work page 2021
-
[18]
Pldp-fl: Federated learning with personalized local differential privacy,
X. Shen, H. Jiang, Y . Chen, B. Wang, and L. Gao, “Pldp-fl: Federated learning with personalized local differential privacy,”Entropy, vol. 25, no. 3, p. 485, 2023
work page 2023
-
[19]
Deep learning for medical image processing: Overview, challenges and the future,
M. I. Razzak, S. Naz, and A. Zaib, “Deep learning for medical image processing: Overview, challenges and the future,”Classification in BioApps: Automation of Decision Making, pp. 323–350, 2018
work page 2018
-
[20]
Optimization strategies for client drift in federated learning: A review,
Y . Shi, Y . Zhang, Y . Xiao, and L. Niu, “Optimization strategies for client drift in federated learning: A review,”Procedia Computer Science, vol. 214, pp. 1168–1173, 2022, 9th International Conference on Information Technology and Quantitative Management
work page 2022
-
[21]
Client selection for federated learning with non-iid data in mobile edge computing,
W. Zhang, X. Wang, P. Zhou, W. Wu, and X. Zhang, “Client selection for federated learning with non-iid data in mobile edge computing,”IEEE Access, vol. 9, pp. 24 462–24 474, 2021
work page 2021
-
[22]
An empirical study of distance metrics for k-nearest neighbor algorithm,
K. Chomboon, P. Chujai, P. Teerarassamee, K. Kerdprasop, and N. Kerd- prasop, “An empirical study of distance metrics for k-nearest neighbor algorithm,” inProceedings of the 3rd international conference on industrial application engineering, vol. 2, 2015
work page 2015
-
[23]
J. S ´ainz-Pardo D ´ıaz, M. Castrillo, J. Bartok, I. H. Cach ´a, I. M. Ond´ık, I. Martynovskyi, K. Alibabaei, L. Berberi, V . Kozlov, and ´A. L ´opez Garc´ıa, “Personalized federated learning for improving radar based precipitation nowcasting on heterogeneous areas,”Earth Science Informatics, vol. 17, no. 6, pp. 5561–5584, 2024
work page 2024
-
[24]
Global Cancer Patients 2015–2024,
Z. Feroze, “Global Cancer Patients 2015–2024,” https://www.kaggle.com/datasets/zahidmughal2343/global-cancer- patients-2015-2024, 2025, accessed: 18-06-2025
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.