pith. sign in

arxiv: 2605.02372 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.AI

Privacy Preserving Machine Learning Workflow: from Anonymization to Personalized Differential Privacy Budgets in Federated Learning

Pith reviewed 2026-05-08 17:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords federated learningdifferential privacyanonymizationpersonalized privacy budgetsre-identification riskclient driftmedical recordsprivacy preserving machine learning
0
0 comments X

The pith

Personalized differential privacy budgets based on re-identification risk improve federated learning model performance over fixed budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a complete workflow for privacy-preserving federated learning on sensitive tabular data that starts with anonymization and adds differential privacy controls. It introduces a formal definition of client drift along with detection techniques to reduce poisoning risks. The main contribution is a method for setting different privacy budgets for each client according to a re-identification risk metric computed from their local data. On an open medical records dataset the experiments show that this personalized-budget version produces lower error in two metrics than the version that applies one fixed budget to every client. A reader concerned with real-world deployment would care because the approach aims to keep privacy protections intact while recovering some of the accuracy that uniform privacy rules typically sacrifice.

Core claim

The paper claims that, within a federated learning network handling sensitive tabular records, allocating distinct differential privacy budgets to each client on the basis of a re-identification risk metric yields measurably better model performance than the conventional choice of a single global privacy budget applied uniformly across all clients.

What carries the argument

The client-specific privacy budget assignment procedure that converts each participant's re-identification risk score into an individual differential privacy parameter.

If this is right

  • The workflow integrates anonymization, drift detection, and personalized budgets into one end-to-end process for tabular sensitive data.
  • Client drift detection supplies a concrete mechanism for spotting and limiting poisoning attempts during federation rounds.
  • Personalized budgets allow the network to avoid the uniform performance penalty that a single strict privacy level imposes on every participant.
  • The experimental comparison on medical records supplies direct evidence that utility can be improved while the overall privacy accounting remains differential privacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the risk metric can be kept stable across training rounds, the same logic could support budgets that adapt as new data arrives at each client.
  • The same personalization principle might apply outside medicine once comparable risk metrics are defined for other tabular domains such as finance or census data.
  • Making the risk metric itself differentially private would remove a possible information leak while still allowing the performance gains to be tested.

Load-bearing premise

A re-identification risk metric can be computed reliably from each client's own data and turned into a privacy budget without introducing bias, circular dependence on the training process, or fresh attack surfaces.

What would settle it

Re-running the medical-records experiments and finding that the personalized-budget models do not produce lower error than the fixed-budget models on both reported metrics.

Figures

Figures reproduced from arXiv: 2605.02372 by \'Alvaro L\'opez Garc\'ia, Judith S\'ainz-Pardo D\'iaz.

Figure 1
Figure 1. Figure 1: Divergence between clients calculated using Equation 2. view at source ↗
read the original abstract

The growing development of artificial intelligence based solutions, together with privacy legislation, has driven the rise of the so-called privacy preserving machine learning architectures, such as federated learning. While federated learning enables model training on decentralized data preventing their sharing and centralization, it still faces several challenges related to data integrity and privacy. This paper presents a comprehensive privacy preserving federated learning workflow for sensitive tabular data, including anonymization and differential privacy techniques. We also introduce a formal definition for the concept of client drift, together with ways of detecting it to mitigate poisoning attacks. Then, we detail a complete methodology for assigning personalized privacy budgets for global differential privacy to the different clients participating in the network, based on a re-identification risk metric. The proposed methodology is presented and tested on an openly available dataset of medical records. Within the experimental setup we show that the approach based on personalized budgets, compared to the architecture including global differential privacy with fixed privacy budget, achieves a better model performance in terms of two error metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a comprehensive privacy-preserving federated learning workflow for sensitive tabular data. It combines anonymization techniques with global differential privacy, introduces a formal definition of client drift along with detection methods to mitigate poisoning attacks, and details a methodology for assigning personalized differential privacy budgets to clients based on a per-client re-identification risk metric. The approach is evaluated on an openly available medical records dataset, with the claim that personalized budgets yield better model performance than a fixed-budget global DP baseline according to two error metrics.

Significance. If the re-identification risk metric is shown to be computed independently of the training data and model parameters without introducing bias or new attack surfaces, the personalized budget assignment could meaningfully improve the utility-privacy trade-off in heterogeneous federated settings such as healthcare. The client drift formalization and detection mechanism would add a useful robustness component against poisoning. However, the current experimental evidence is insufficient to establish these benefits.

major comments (3)
  1. [Abstract and Experimental section] Abstract and Experimental section: The claim of superior performance on two error metrics versus fixed-budget global DP lacks any reported dataset size, exact metric definitions (e.g., MSE, MAE, or classification error), statistical significance tests, baseline implementation details, or description of how the re-identification risk metric is calculated and applied. Without these, the headline result cannot be verified or reproduced.
  2. [Section on personalized DP budget assignment] Section on personalized DP budget assignment: The re-identification risk metric used to derive per-client budgets must be demonstrated to be independent of the model parameters and training data itself. If the metric is derived from the same client data used for model training, the personalization step risks circular dependence, making the reported performance gain an artifact of data partitioning rather than a genuine benefit of adaptive budgeting.
  3. [Experimental setup] Experimental setup: No information is provided on how the risk metric is computed from client data, whether it consumes privacy budget, or whether it correlates with data properties that independently affect model utility. These omissions prevent isolation of the personalization effect from confounding factors.
minor comments (2)
  1. [Client drift section] The formal definition of client drift is introduced but its integration with the DP workflow and any empirical validation of the detection method against poisoning attacks are not detailed.
  2. Notation for privacy budgets (ε values) and risk metric should be made consistent across the methodology and experimental sections to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that several clarifications and additions are needed to strengthen the experimental claims and address potential concerns about the re-identification risk metric. We will revise the manuscript accordingly and respond point-by-point to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and Experimental section] Abstract and Experimental section: The claim of superior performance on two error metrics versus fixed-budget global DP lacks any reported dataset size, exact metric definitions (e.g., MSE, MAE, or classification error), statistical significance tests, baseline implementation details, or description of how the re-identification risk metric is calculated and applied. Without these, the headline result cannot be verified or reproduced.

    Authors: We agree that the abstract and experimental section require additional details for reproducibility. In the revised manuscript we will report the exact dataset size, provide precise definitions of the two error metrics (including whether they are MSE, MAE, or classification error), include statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values), describe the fixed-budget baseline implementation in full, and add an explicit description of how the re-identification risk metric is calculated and mapped to per-client privacy budgets. These additions will make the performance claims verifiable. revision: yes

  2. Referee: [Section on personalized DP budget assignment] Section on personalized DP budget assignment: The re-identification risk metric used to derive per-client budgets must be demonstrated to be independent of the model parameters and training data itself. If the metric is derived from the same client data used for model training, the personalization step risks circular dependence, making the reported performance gain an artifact of data partitioning rather than a genuine benefit of adaptive budgeting.

    Authors: This concern is well-taken. The risk metric is computed from local client data statistics (e.g., quasi-identifier uniqueness measures) in a pre-training step that does not reference model parameters or the global training process. To eliminate any ambiguity, the revised manuscript will include a formal argument and/or proof in the personalized DP section demonstrating that the metric is independent of both the model parameters and the training data used for learning, thereby ruling out circular dependence. revision: yes

  3. Referee: [Experimental setup] Experimental setup: No information is provided on how the risk metric is computed from client data, whether it consumes privacy budget, or whether it correlates with data properties that independently affect model utility. These omissions prevent isolation of the personalization effect from confounding factors.

    Authors: We accept that the experimental setup description is incomplete on these points. The revised version will specify the exact procedure for computing the risk metric from client data, explicitly state that the metric computation is a non-private pre-processing step that consumes no differential privacy budget, and add discussion plus controls (e.g., correlation analysis or ablation studies) to isolate the personalization effect from other data properties that may influence utility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; workflow is presented as empirical methodology

full rationale

The paper outlines a federated learning workflow that combines anonymization, global differential privacy, client drift detection, and a methodology for assigning personalized privacy budgets via a re-identification risk metric. The central experimental claim is an empirical comparison showing improved error metrics under personalized budgets versus fixed budgets. No equations, definitions, or self-citations are exhibited in the provided text that reduce the risk metric computation, budget assignment, or performance gain to a definitional tautology or fitted input renamed as prediction. The derivation chain is a sequence of standard privacy techniques applied to tabular medical data, with the personalization step treated as an independent methodological choice whose validity is tested experimentally rather than assumed by construction. Absent explicit reduction (e.g., metric defined from training loss or performance metric), the result remains falsifiable against external benchmarks and does not meet the threshold for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract alone does not expose explicit free parameters, axioms, or invented entities; the re-identification risk metric and client drift definition are introduced but their foundational assumptions and computation details are not stated.

pith-pipeline@v0.9.0 · 5488 in / 1195 out tokens · 51929 ms · 2026-05-08T17:50:57.697648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    A review of machine learning and deep learning applications,

    P. P. Shinde and S. Shah, “A review of machine learning and deep learning applications,” in2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), 2018, pp. 1–6

  2. [2]

    Deep learning for healthcare: review, opportunities and challenges,

    R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for healthcare: review, opportunities and challenges,”Briefings in bioinformatics, vol. 19, no. 6, pp. 1236–1246, 2018

  3. [3]

    Regulation (EU) 2016/679 of the European Parliament and of the Council

    European Parliament and Council of the European Union, “Regulation (EU) 2016/679 of the European Parliament and of the Council.” 2016, [Accessed 20-05-2025]. [Online]. Available: https://data.europa.eu/eli/reg/2016/679/oj

  4. [4]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273– 1282

  5. [5]

    A survey on federated learning,

    C. Zhang, Y . Xie, H. Bai, B. Yu, W. Li, and Y . Gao, “A survey on federated learning,”Knowledge-Based Systems, vol. 216, p. 106775, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705121000381

  6. [6]

    Federated learning for medical image analysis: A survey,

    H. Guan, P.-T. Yap, A. Bozoki, and M. Liu, “Federated learning for medical image analysis: A survey,”Pattern Recognition, p. 110424, 2024

  7. [7]

    A systematic review on federated learning in medical image analysis,

    M. F. Sohan and A. Basalamah, “A systematic review on federated learning in medical image analysis,”IEEE Access, vol. 11, pp. 28 628– 28 644, 2023

  8. [8]

    Privacy and robustness in federated learning: Attacks and defenses,

    L. Lyu, H. Yu, X. Ma, C. Chen, L. Sun, J. Zhao, Q. Yang, and P. S. Yu, “Privacy and robustness in federated learning: Attacks and defenses,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 7, pp. 8726–8746, 2024

  9. [9]

    Survey on federated learning threats: Concepts, taxonomy on attacks and defences, experimental study and challenges,

    N. Rodr ´ıguez-Barroso, D. Jim´enez-L´opez, M. V . Luz´on, F. Herrera, and E. Mart ´ınez-C´amara, “Survey on federated learning threats: Concepts, taxonomy on attacks and defences, experimental study and challenges,” Information Fusion, vol. 90, pp. 148–173, 2023

  10. [10]

    A python library to check the level of anonymity of a dataset,

    J. S ´ainz-Pardo D´ıaz and ´A. L ´opez Garc´ıa, “A python library to check the level of anonymity of a dataset,”Scientific Data, vol. 9, no. 1, p. 785, 2022

  11. [11]

    Domingo-Ferrer, D

    J. Domingo-Ferrer, D. S ´anchez, and J. Soria-Comas,Database anonymization: privacy models, data utility, and microaggregation- based inter-model connections. Morgan & Claypool Publishers, 2016

  12. [12]

    An open source python library for anonymizing sensitive data,

    J. S ´ainz-Pardo D ´ıaz and ´A. L ´opez Garc ´ıa, “An open source python library for anonymizing sensitive data,”Scientific data, vol. 11, no. 1, p. 1289, 2024

  13. [13]

    Output privacy in data mining,

    T. Wang and L. Liu, “Output privacy in data mining,”ACM Transactions on Database Systems (TODS), vol. 36, no. 1, pp. 1–34, 2011

  14. [14]

    The algorithmic foundations of differential privacy,

    C. Dwork, A. Rothet al., “The algorithmic foundations of differential privacy,”Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014

  15. [15]

    Kim and H

    K. Kim and H. C. Tanuwidjaja,Privacy-preserving Deep Learning: A Comprehensive Survey. Springer, 2021

  16. [16]

    T. Zhu, G. Li, W. Zhou, and S. Y . Philip,Differential privacy and applications. Springer, 2017

  17. [17]

    Federated learning with personalized local differential privacy,

    G. Yang, S. Wang, and H. Wang, “Federated learning with personalized local differential privacy,” in2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), 2021, pp. 484–489

  18. [18]

    Pldp-fl: Federated learning with personalized local differential privacy,

    X. Shen, H. Jiang, Y . Chen, B. Wang, and L. Gao, “Pldp-fl: Federated learning with personalized local differential privacy,”Entropy, vol. 25, no. 3, p. 485, 2023

  19. [19]

    Deep learning for medical image processing: Overview, challenges and the future,

    M. I. Razzak, S. Naz, and A. Zaib, “Deep learning for medical image processing: Overview, challenges and the future,”Classification in BioApps: Automation of Decision Making, pp. 323–350, 2018

  20. [20]

    Optimization strategies for client drift in federated learning: A review,

    Y . Shi, Y . Zhang, Y . Xiao, and L. Niu, “Optimization strategies for client drift in federated learning: A review,”Procedia Computer Science, vol. 214, pp. 1168–1173, 2022, 9th International Conference on Information Technology and Quantitative Management

  21. [21]

    Client selection for federated learning with non-iid data in mobile edge computing,

    W. Zhang, X. Wang, P. Zhou, W. Wu, and X. Zhang, “Client selection for federated learning with non-iid data in mobile edge computing,”IEEE Access, vol. 9, pp. 24 462–24 474, 2021

  22. [22]

    An empirical study of distance metrics for k-nearest neighbor algorithm,

    K. Chomboon, P. Chujai, P. Teerarassamee, K. Kerdprasop, and N. Kerd- prasop, “An empirical study of distance metrics for k-nearest neighbor algorithm,” inProceedings of the 3rd international conference on industrial application engineering, vol. 2, 2015

  23. [23]

    Personalized federated learning for improving radar based precipitation nowcasting on heterogeneous areas,

    J. S ´ainz-Pardo D ´ıaz, M. Castrillo, J. Bartok, I. H. Cach ´a, I. M. Ond´ık, I. Martynovskyi, K. Alibabaei, L. Berberi, V . Kozlov, and ´A. L ´opez Garc´ıa, “Personalized federated learning for improving radar based precipitation nowcasting on heterogeneous areas,”Earth Science Informatics, vol. 17, no. 6, pp. 5561–5584, 2024

  24. [24]

    Global Cancer Patients 2015–2024,

    Z. Feroze, “Global Cancer Patients 2015–2024,” https://www.kaggle.com/datasets/zahidmughal2343/global-cancer- patients-2015-2024, 2025, accessed: 18-06-2025