PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

Amir Rahmani; Arshia Ilaty; Dhanalakshmi Ramesh; Hajar Homayouni; Hossein Shirazi; Kedar Hegde; Manasi Chitale; Rashmi S. Manjunath

arxiv: 2606.18518 · v1 · pith:PUTEDX27new · submitted 2026-06-16 · 💻 cs.LG · cs.AI

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

Arshia Ilaty , Hossein Shirazi , Manasi Chitale , Kedar Hegde , Dhanalakshmi Ramesh , Rashmi S. Manjunath , Amir Rahmani , Hajar Homayouni This is my paper

Pith reviewed 2026-06-27 00:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords synthetic clinical dataprivacy preservationconstrained optimizationAugmented Lagrangian Methodhealth AImembership inferencetabular data generation

0 comments

The pith

PSyGenTAB generates synthetic clinical tabular data by solving a constrained optimization problem that embeds privacy thresholds directly into training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative framework that treats synthetic healthcare data creation as a constrained optimization task. It solves this task with the Augmented Lagrangian Method so that minimum privacy requirements are enforced while clinical relationships and minority diagnostic patterns are retained. Downstream tests show that machine learning models trained on the resulting synthetic records reach performance levels close to those trained on real patient data under both Train-on-Synthetic/Test-on-Real and Train-on-Real/Test-on-Synthetic protocols. Privacy audits indicate lower rates of exact record reproduction and greater resistance to membership inference attacks than prior methods. If these results hold, institutions could share synthetic versions of clinical data for collaborative AI development without violating privacy regulations.

Core claim

By casting synthetic data generation as a constrained optimization problem and solving it with the Augmented Lagrangian Method, PSyGenTAB directly incorporates configurable privacy constraints into the training objective. This produces tabular records that preserve inter-feature clinical correlations and minority-class diagnostic patterns while meeting explicit privacy thresholds, yielding downstream model performance comparable to real data and reduced vulnerability to re-identification.

What carries the argument

Formulating synthetic clinical tabular data generation as a constrained optimization problem solved via the Augmented Lagrangian Method, with privacy constraints embedded in the training loop.

If this is right

Models trained on the synthetic data achieve performance comparable to real-data models on both Train-on-Synthetic/Test-on-Real and Train-on-Real/Test-on-Synthetic evaluations.
Generated records show reduced exact reproduction of original patient entries.
The framework demonstrates stronger resistance to membership inference attacks than existing synthetic data methods.
Inter-feature clinical relationships and minority-class diagnostic patterns remain intact across multiple clinically motivated benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constrained-optimization structure could be adapted to generate synthetic data for non-clinical tabular domains that also require strict privacy controls.
Institutions could use the framework to create shareable synthetic cohorts that support multi-site model training without exchanging raw records.
If the privacy constraints prove robust under repeated attacks, regulators might accept synthetic data as a compliant alternative for certain model-development workflows.

Load-bearing premise

Embedding privacy constraints directly into model training through the Augmented Lagrangian Method will simultaneously satisfy minimum privacy thresholds and retain clinically meaningful patterns without later degradation of downstream utility.

What would settle it

A controlled experiment in which models trained on the synthetic data exhibit statistically significant drops in diagnostic accuracy on held-out real patient records, or in which membership inference attacks recover patient identities at rates above those reported for the real baseline.

read the original abstract

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSyGenTAB frames synthetic clinical data as constrained optimization with ALM but supplies no equations or metrics showing the privacy constraints actually work.

read the letter

The paper's main contribution is casting synthetic tabular clinical data generation as a constrained optimization problem solved by the Augmented Lagrangian Method, with privacy thresholds written directly into the training. That is a reasonable way to make the privacy-utility tradeoff explicit rather than post-hoc.

The approach is new in its emphasis on embedding configurable privacy constraints inside the optimizer. The abstract correctly identifies the right evaluation protocols (TSTR and TRTS) and flags the need to keep minority-class patterns intact, which matters in clinical settings.

The soft spot is the one flagged in the stress-test note. Membership inference resistance and exact-record non-reproduction are statistical properties measured after sampling; turning them into deterministic, differentiable constraints that ALM can enforce without degrading utility on rare diagnoses is not straightforward. The abstract gives no equations for those constraints, no description of how they are approximated inside the Lagrangian, and no numbers on dual convergence or threshold satisfaction. Without that, the claim that minimum privacy levels are met while clinical relationships are preserved remains unverified.

Benchmark details are also thin: no dataset sizes, no list of baselines, and no error bars or ablation on the privacy parameters. The downstream performance is described only as “comparable,” which is hard to interpret.

This work is for researchers already working on privacy-preserving synthetic data for health applications. A reader who knows the standard GAN or diffusion baselines would get the most from seeing whether the constrained formulation improves on them.

It deserves peer review because the topic is important and the optimization framing is a legitimate direction, even though the current version is light on the technical evidence needed to assess the central claim.

Referee Report

2 major / 1 minor

Summary. The paper introduces PSyGenTAB, a framework for generating synthetic clinical tabular data that formulates the task as a constrained optimization problem solved using the Augmented Lagrangian Method (ALM). By embedding privacy constraints directly into the training process, it aims to enforce minimum privacy thresholds while maximizing clinical utility. The authors claim that across clinically motivated benchmarks, the method preserves inter-feature relationships and minority-class patterns, with downstream models achieving comparable performance to those trained on real data under TOS/TOR and TOR/TOS protocols, and improved privacy metrics against record reproduction and membership inference attacks.

Significance. If the claims hold, this work would offer a significant advancement in synthetic data generation for healthcare by providing a principled, optimization-based approach to the privacy-utility trade-off, potentially enabling more secure data sharing for AI development. The use of ALM for explicit constraint handling is a notable methodological choice that could generalize beyond the presented benchmarks.

major comments (2)

[Abstract] Abstract: The central claim that ALM enforces minimum privacy thresholds (reduced exact record reproduction and resilience to membership inference) while preserving minority-class diagnostic patterns rests on translating distribution-dependent privacy notions into deterministic, differentiable constraints amenable to dual updates. The abstract provides no formulation details showing how these are expressed as functions of generated samples or their statistics, leaving open whether the method uses hard constraints or soft penalties that may fail to meet thresholds or degrade utility on minority classes.
[Abstract] The weakest assumption (embedding configurable privacy constraints via ALM to simultaneously enforce thresholds and preserve patterns) is load-bearing; if privacy metrics require post-hoc Monte-Carlo evaluation rather than direct constraint functions, the optimization may not deliver the claimed guarantees without additional verification steps not described in the abstract.

minor comments (1)

[Abstract] The abstract supplies no implementation details, benchmark definitions, quantitative metrics, or error analysis, making it impossible to verify the optimization's support for the stated claims from the provided text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. The concerns highlight the need for clearer high-level formulation details in the abstract itself. We address each point below and will revise the abstract accordingly while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that ALM enforces minimum privacy thresholds (reduced exact record reproduction and resilience to membership inference) while preserving minority-class diagnostic patterns rests on translating distribution-dependent privacy notions into deterministic, differentiable constraints amenable to dual updates. The abstract provides no formulation details showing how these are expressed as functions of generated samples or their statistics, leaving open whether the method uses hard constraints or soft penalties that may fail to meet thresholds or degrade utility on minority classes.

Authors: We agree that the abstract, as a high-level summary, does not include the explicit functional forms. In the full manuscript (Section 3), privacy constraints are expressed as differentiable functions of generated samples: exact record reproduction is penalized via a soft indicator based on Euclidean distance thresholds to real records, and membership inference resilience uses a differentiable approximation of attack success rate via logistic loss on sample statistics. These enter the ALM as inequality constraints with dual variable updates, functioning as soft penalties that asymptotically enforce thresholds. Minority-class patterns are preserved via separate utility constraints on class-conditional statistics. We will revise the abstract to briefly note that privacy notions are translated into differentiable sample-based functions solved via ALM. revision: yes
Referee: [Abstract] The weakest assumption (embedding configurable privacy constraints via ALM to simultaneously enforce thresholds and preserve patterns) is load-bearing; if privacy metrics require post-hoc Monte-Carlo evaluation rather than direct constraint functions, the optimization may not deliver the claimed guarantees without additional verification steps not described in the abstract.

Authors: The optimization uses direct, differentiable constraint functions (as detailed in Section 3) rather than post-hoc Monte-Carlo; the latter is reserved exclusively for final auditing in the experiments. The ALM dual updates operate on the embedded functions to enforce thresholds during training. We acknowledge the abstract does not distinguish this, which could lead to the noted ambiguity. We will revise the abstract to clarify that constraints are direct and differentiable (with post-hoc evaluation used only for reporting). revision: yes

Circularity Check

0 steps flagged

No circularity detected; formulation is an independent modeling choice

full rationale

The paper frames synthetic data generation as a constrained optimization problem solved via the standard Augmented Lagrangian Method, with privacy constraints embedded as configurable terms. This is an explicit modeling decision rather than a derivation that reduces to fitted inputs or self-citations. Downstream Train-on-Synthetic/Test-on-Real evaluations and privacy audits are presented as separate empirical checks, not forced by construction. No equations, self-citation chains, or renamings appear in the abstract that would create circularity. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of specific free parameters, axioms, or invented entities. The approach appears to rest on standard constrained optimization techniques whose details are not supplied.

pith-pipeline@v0.9.1-grok · 5781 in / 1168 out tokens · 49490 ms · 2026-06-27T00:58:06.863524+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Health insurance portability and accountability act of 1996 (hipaa)

U. Congress, “Health insurance portability and accountability act of 1996 (hipaa).” https://www.hhs.gov/hipaa/index.html, 1996. Accessed May 2025

1996
[2]

Regulation (eu) 2016/679 of the european parliament and of the council (general data protection regulation)

E. Union, “Regulation (eu) 2016/679 of the european parliament and of the council (general data protection regulation).” https://gdpr-info.eu,

2016
[3]

Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record,

J. Walonoski, M. Kramer, J. Nichols, A. Quina, C. Moesel, D. Hall, C. Duffett, K. Dube, T. Gallagher, and S. McLachlan, “Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record,”Journal of the American Medical Informatics Association, vol. 0, pp. 1–9, 09 2017

2017
[4]

Gen- erating multi-label discrete patient records using generative adversarial networks,

E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen- erating multi-label discrete patient records using generative adversarial networks,”Machine Learning for Healthcare Conference, pp. 286–305, 2017

2017
[5]

Deep learning with differential privacy,

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318, 2016

2016
[6]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27, 2014

2014
[7]

Gen- erating multi-label discrete patient records using generative adversarial networks,

E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen- erating multi-label discrete patient records using generative adversarial networks,” inProceedings of Machine Learning for Healthcare (MLHC), vol. 68 ofProceedings of Machine Learning Research (PMLR), pp. 286– 305, PMLR, 2017

2017
[8]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[9]

Privacy-preserving synthetic med- ical data generation using variational autoencoders,

S. Dash, O. G ¨unl¨uk, and D. Wei, “Privacy-preserving synthetic med- ical data generation using variational autoencoders,”arXiv preprint arXiv:2012.15328, 2020

work page arXiv 2012
[10]

Synthesizing Tabular Data using Generative Adversarial Networks

L. Xu and K. Veeramachaneni, “Synthesizing tabular data using gener- ative adversarial networks,”arXiv preprint arXiv:1811.11264, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Modeling tabular data using conditional gan,

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,”Advances in neural information processing systems, vol. 32, 2019

2019
[12]

Ctab-gan+: Enhancing tabular data synthesis,

Z. Zhao, A. Kunar, R. Birke, and L. Y . Chen, “Ctab-gan+: Enhancing tabular data synthesis,” 2022

2022
[13]

Realtabformer: Generating real- istic relational and tabular data using transformers,

A. V . Solatorio and O. Dupriez, “Realtabformer: Generating real- istic relational and tabular data using transformers,”arXiv preprint arXiv:2302.02041, 2023

work page arXiv 2023
[14]

Language models are realistic tabular data generators,

V . Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci, “Language models are realistic tabular data generators,”arXiv preprint arXiv:2210.06280, 2022

work page arXiv 2022
[15]

Tabddpm: Modelling tabular data with diffusion models,

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabddpm: Modelling tabular data with diffusion models,” inInternational Confer- ence on Machine Learning, pp. 17564–17579, PMLR, 2023

2023
[16]

Mixed-type tabular data synthesis with score-based diffusion in latent space,

H. Zhang, J. Zhang, B. Srinivasan, Z. Shen, X. Qin, C. Faloutsos, H. Rangwala, and G. Karypis, “Mixed-type tabular data synthesis with score-based diffusion in latent space,” inThe twelfth International Conference on Learning Representations, 2024

2024
[17]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” in2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18, IEEE, 2017

2017
[18]

Syn- thetic data generation for tabular health records: A systematic review,

M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Syn- thetic data generation for tabular health records: A systematic review,” Neurocomputing, vol. 493, pp. 28–45, 2022

2022
[19]

Preserving privacy in healthcare: A systematic review of deep learning approaches for synthetic data generation,

Y . Liu, U. R. Acharya, and J. H. Tan, “Preserving privacy in healthcare: A systematic review of deep learning approaches for synthetic data generation,”Computer Methods and Programs in Biomedicine, vol. 260, p. 108571, 2025

2025
[20]

k-anonymity: A model for protecting privacy,

L. Sweeney, “k-anonymity: A model for protecting privacy,”Interna- tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002

2002
[21]

l-diversity: Privacy beyond k-anonymity,

A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond k-anonymity,”ACM Transactions on Knowl- edge Discovery from Data (TKDD), vol. 1, no. 1, pp. 3–es, 2007

2007
[22]

t-closeness: Privacy beyond k- anonymity and l-diversity,

N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k- anonymity and l-diversity,” in2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115, IEEE, 2007

2007
[23]

The future of digital health with federated learning,

N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, S. Herrmann, J. Shotton, J. Trees, B. Kainz, R. Cobb, B. Glocker, and D. Rueckert, “The future of digital health with federated learning,”npj Digital Medicine, vol. 3, no. 1, p. 119, 2020

2020
[24]

Federated learning for healthcare: Systematic review and architecture proposal,

J. Park, J. Yoon, S. Keum, J. Oh, M. Lee, and J.-W. Kim, “Federated learning for healthcare: Systematic review and architecture proposal,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 5, pp. 1478–1491, 2021

2021
[25]

Deep leakage from gradients,

L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” inAd- vances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 14774–14784, 2019

2019
[26]

Differential privacy,

C. Dwork, “Differential privacy,” inAutomata, Languages and Program- ming(M. Bugliesi, B. Preneel, V . Sassone, and I. Wegener, eds.), (Berlin, Heidelberg), pp. 1–12, Springer Berlin Heidelberg, 2006

2006
[27]

The algorithmic foundations of differential privacy,

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,”Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014

2014
[28]

Dp-gan: Differentially private consecutive data publishing using generative adversarial nets,

S. Ho, Y . Qu, B. Gu, L. Gao, J. Li, and Y . Xiang, “Dp-gan: Differentially private consecutive data publishing using generative adversarial nets,” Journal of Network and Computer Applications, vol. 185, p. 103066, 2021

2021
[29]

Dp-ctgan: Differentially pri- vate medical data generation using ctgans,

M. L. Fang, D. S. Dhami, and K. Kersting, “Dp-ctgan: Differentially pri- vate medical data generation using ctgans,” inInternational Conference on Artificial Intelligence in Medicine, pp. 178–188, Springer, 2022

2022
[30]

Pate-gan: Generating synthetic data with differential privacy guarantees,

J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating synthetic data with differential privacy guarantees,” inInternational conference on learning representations, 2018

2018
[31]

Membership inference attacks from first principles,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, ´U. Erlingsson,et al., “Membership inference attacks from first principles,” in2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914, IEEE, 2021

1914
[32]

Synthetic data — what, why and how?,

J. Jordon, L. Szpruch, F. Houssiau, M. Bottarelli, G. Cherubin, C. Maple, S. N. Cohen, and A. Weller, “Synthetic data — what, why and how?,” arXiv preprint arXiv:2205.03257, 2022

work page arXiv 2022
[33]

A comprehensive evaluation frame- work for synthetic medical tabular data generation,

A. Kurakova and H. Homayouni, “A comprehensive evaluation frame- work for synthetic medical tabular data generation,”ACM Transactions on Computing for Healthcare, 2024. Submitted for publication

2024
[34]

SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering,

A. Ilaty, H. Shirazi, and H. Homayouni, “SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering,” 2025. 10 Pages, 2 Supplementary Pages, 6 Tables

2025
[35]

Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data,

A. D. Lautrup, T. Hyrup, A. Zimek, and P. Schneider-Kamp, “Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data,”Data Mining and Knowledge Discovery, vol. 39, Dec. 2024

2024
[36]

Multiplier and gradient methods,

M. R. Hestenes, “Multiplier and gradient methods,”Journal of Opti- mization Theory and Applications, vol. 4, no. 5, pp. 303–320, 1969

1969
[37]

A method for nonlinear constraints in minimization problems,

M. J. D. Powell, “A method for nonlinear constraints in minimization problems,”Optimization, pp. 283–298, 1969. 14

1969
[38]

The multiplier method of Hestenes and Powell applied to convex programming,

R. T. Rockafellar, “The multiplier method of Hestenes and Powell applied to convex programming,”Journal of Optimization Theory and Applications, vol. 12, no. 6, pp. 555–562, 1973

1973
[39]

D. P. Bertsekas,Constrained Optimization and Lagrange Multiplier Methods. New York, NY , USA: Academic Press, 2014

2014
[40]

Stochastic inexact augmented lagrangian method for nonconvex expectation constrained optimization,

Z. Li, P.-Y . Chen, S. Liu, S. Lu, and Y . Xu, “Stochastic inexact augmented lagrangian method for nonconvex expectation constrained optimization,”Computational Optimization and Applications, vol. 87, no. 1, pp. 117–147, 2024

2024
[41]

Learning constrained optimization with deep augmented lagrangian methods,

J. Kotary and F. Fioretto, “Learning constrained optimization with deep augmented lagrangian methods,” 2024

2024
[42]

Two-Player Games for Efficient Non-Convex Constrained Optimization

A. Cotter, H. Jiang, and K. Sridharan, “Two-player games for efficient non-convex constrained optimization,”arXiv preprint arXiv:1804.06500, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

A multidimensional version of the kolmogorov–smirnov test,

G. Fasano and A. Franceschini, “A multidimensional version of the kolmogorov–smirnov test,”Monthly Notices of the Royal Astronomical Society, vol. 225, no. 1, pp. 155–170, 1987

1987
[44]

On information and sufficiency,

S. Kullback and R. A. Leibler, “On information and sufficiency,”The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951

1951
[45]

The jensen-shannon divergence,

M. Men ´endez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307– 318, 1997

1997
[46]

Synthetic data metrics,

“Synthetic data metrics,” 04 2024. Version 0.14.0

2024
[47]

On the generation and evaluation of tabular data using gans,

B. Brenninkmeijer, A. de Vries, E. Marchiori, and Y . Hille, “On the generation and evaluation of tabular data using gans,”PhD diss., Radboud University, 2019

2019
[48]

Feature selection based on mutual information with correlation coefficient,

H. Zhou, X. Wang, and R. Zhu, “Feature selection based on mutual information with correlation coefficient,”Applied intelligence, vol. 52, no. 5, pp. 5457–5474, 2022

2022
[49]

Tabsyndex: A universal metric for robust evaluation of synthetic tabular data,

V . S. Chundawat, A. K. Tarun, M. Mandal, M. Lahoti, and P. Narang, “Tabsyndex: A universal metric for robust evaluation of synthetic tabular data,”arXiv preprint arXiv:2207.05295, 2022

work page arXiv 2022
[50]

Using dynamic time warping to find pat- terns in time series,

D. J. Berndt and J. Clifford, “Using dynamic time warping to find pat- terns in time series,” inProceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94, p. 359–370, AAAI Press, 1994

1994
[51]

Tunnicliffe Wilson, “Time series analysis: Forecasting and control,5th edition, by george e

G. Tunnicliffe Wilson, “Time series analysis: Forecasting and control,5th edition, by george e. p. box, gwilym m. jenkins, gregory c. reinsel and greta m. ljung, 2015. published by john wiley and sons inc., hoboken, new jersey, pp. 712. isbn: 978-1-118-67502-1,”Journal of Time Series Analysis, vol. 37, pp. n/a–n/a, 03 2016

2015
[52]

Data Synthesis based on Generative Adversarial Networks

N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y . Kim, “Data synthesis based on generative adversarial networks,”arXiv preprint arXiv:1806.03384, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set

D. Dua and C. Graff, “UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set.” UCI Machine Learning Repository,
[54]

University of California, Irvine, School of Information and Computer Sciences
[55]

Incidence of diagnosed diabetes in adults — united states, 1980–2014,

N. R. Burrows, I. Hora, L. S. Geiss, E. W. Gregg, and A. Albright, “Incidence of diagnosed diabetes in adults — united states, 1980–2014,” MMWR Morbidity and Mortality Weekly Report, vol. 66, no. 12, pp. 306– 309, 2017

1980
[56]

Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone,

D. Chicco and G. Jurman, “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone,”BMC Medical Informatics and Decision Making, vol. 20, no. 16, 2020

2020
[57]

Generating production rules from decision trees,

J. R. Quinlan, “Generating production rules from decision trees,” in Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI), pp. 304–307, 1987

1987
[58]

Bupa liver disorders dataset

B. M. R. Ltd., “Bupa liver disorders dataset.” https://archive.ics.uci.edu/ ml/datasets/Liver+Disorders, 1990. UCI Machine Learning Repository

1990
[59]

Uci machine learning repository

D. Dua and C. Graff, “Uci machine learning repository.” https://archive. ics.uci.edu/ml/datasets/Lung+Cancer, 2019. Lung Cancer Dataset

2019
[60]

Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from colombia, peru and mexico,

F. M. Palechor and A. de la Hoz Manotas, “Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from colombia, peru and mexico,”Data in Brief, vol. 25, p. 104344, 2019

2019
[61]

Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,

M. A. Little, P. E. McSharry, S. J. Roberts, D. A. E. Costello, and I. M. Moroz, “Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,”BioMedical Engineering OnLine, vol. 6, no. 23, 2007

2007
[62]

Uci machine learning repository

D. Dua and C. Graff, “Uci machine learning repository.” https://archive. ics.uci.edu/ml, 2019. Adult Census Income Dataset

2019
[63]

PIRvision FoG presence detection

M. Emad-ud din, “PIRvision FoG presence detection.” UCI Machine Learning Repository, 2023

2023
[64]

Vietnam banking transaction dataset for fraud detection

H. T. Nguyen and T. N. Tran, “Vietnam banking transaction dataset for fraud detection.” Public financial transaction dataset, 2020. If sourced from Kaggle or Zenodo, include DOI here

2020
[65]

Democra- tizing tabular data access with an open-source synthetic-data sdk,

I. Krchova, M. V . Vieyra, M. Scriminaci, and A. Sidorenko, “Democra- tizing tabular data access with an open-source synthetic-data sdk,” 2025

2025
[66]

El Emam and L

K. El Emam and L. Mosquera,Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. O’Reilly Media, 2020

2020
[67]

General and specific utility measures for synthetic data,

J. Snoke, G. M. Raab, B. Nowok, C. Dibben, and A. Slavkovic, “General and specific utility measures for synthetic data,”Journal of the Royal Statistical Society: Series A, vol. 181, no. 3, pp. 663–688, 2018

2018
[68]

The dcr delusion: Measuring the privacy risk of synthetic data,

Z. Yao, N. Kr ˇco, G. Ganev, and Y .-A. de Montjoye, “The dcr delusion: Measuring the privacy risk of synthetic data,” inComputer Security – ESORICS 2025(V . Nicomette, A. Benzekri, N. Boulahia-Cuppens, and J. Vaidya, eds.), (Cham), pp. 469–487, Springer Nature Switzerland, 2026

2025
[69]

El Emam, L

K. El Emam, L. Mosquera, and R. Hoptroff,Practical synthetic data generation : balancing privacy and the broad availability of data / Khaled El Emam, Lucy Mosquera, and Richard Hoptroff.Sebastopol, CA: O’Reilly Media, 1st edition ed., 2020

2020
[70]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” 2017

2017
[71]

Evaluating differentially private machine learning in practice,

B. Jayaraman and D. Evans, “Evaluating differentially private machine learning in practice,” 2019

2019
[72]

P. J. Huber and E. M. Ronchetti,Robust Statistics. Wiley, 2nd ed., 2009

2009
[73]

Dwork and A

C. Dwork and A. Roth,The Algorithmic Foundations of Differential Privacy, vol. 9 ofFoundations and Trends in Theoretical Computer Science. Now Publishers Inc., 2014

2014
[74]

Reinforced Augmented La- grangian for constrained optimization in deep learning,

H. Yuan, X. Lian, J. Li, J. Liu, and B. Xu, “Reinforced Augmented La- grangian for constrained optimization in deep learning,”arXiv preprint arXiv:2106.01134, 2021. 15 APPENDIX Table IX provides an exhaustive, multi-page breakdown of downstream predictive efficacy across all classifiers and sampling strategies. This summary highlights the Train-on- Synthe...

work page arXiv 2021

[1] [1]

Health insurance portability and accountability act of 1996 (hipaa)

U. Congress, “Health insurance portability and accountability act of 1996 (hipaa).” https://www.hhs.gov/hipaa/index.html, 1996. Accessed May 2025

1996

[2] [2]

Regulation (eu) 2016/679 of the european parliament and of the council (general data protection regulation)

E. Union, “Regulation (eu) 2016/679 of the european parliament and of the council (general data protection regulation).” https://gdpr-info.eu,

2016

[3] [3]

Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record,

J. Walonoski, M. Kramer, J. Nichols, A. Quina, C. Moesel, D. Hall, C. Duffett, K. Dube, T. Gallagher, and S. McLachlan, “Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record,”Journal of the American Medical Informatics Association, vol. 0, pp. 1–9, 09 2017

2017

[4] [4]

Gen- erating multi-label discrete patient records using generative adversarial networks,

E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen- erating multi-label discrete patient records using generative adversarial networks,”Machine Learning for Healthcare Conference, pp. 286–305, 2017

2017

[5] [5]

Deep learning with differential privacy,

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318, 2016

2016

[6] [6]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27, 2014

2014

[7] [7]

Gen- erating multi-label discrete patient records using generative adversarial networks,

E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen- erating multi-label discrete patient records using generative adversarial networks,” inProceedings of Machine Learning for Healthcare (MLHC), vol. 68 ofProceedings of Machine Learning Research (PMLR), pp. 286– 305, PMLR, 2017

2017

[8] [8]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[9] [9]

Privacy-preserving synthetic med- ical data generation using variational autoencoders,

S. Dash, O. G ¨unl¨uk, and D. Wei, “Privacy-preserving synthetic med- ical data generation using variational autoencoders,”arXiv preprint arXiv:2012.15328, 2020

work page arXiv 2012

[10] [10]

Synthesizing Tabular Data using Generative Adversarial Networks

L. Xu and K. Veeramachaneni, “Synthesizing tabular data using gener- ative adversarial networks,”arXiv preprint arXiv:1811.11264, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Modeling tabular data using conditional gan,

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,”Advances in neural information processing systems, vol. 32, 2019

2019

[12] [12]

Ctab-gan+: Enhancing tabular data synthesis,

Z. Zhao, A. Kunar, R. Birke, and L. Y . Chen, “Ctab-gan+: Enhancing tabular data synthesis,” 2022

2022

[13] [13]

Realtabformer: Generating real- istic relational and tabular data using transformers,

A. V . Solatorio and O. Dupriez, “Realtabformer: Generating real- istic relational and tabular data using transformers,”arXiv preprint arXiv:2302.02041, 2023

work page arXiv 2023

[14] [14]

Language models are realistic tabular data generators,

V . Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci, “Language models are realistic tabular data generators,”arXiv preprint arXiv:2210.06280, 2022

work page arXiv 2022

[15] [15]

Tabddpm: Modelling tabular data with diffusion models,

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabddpm: Modelling tabular data with diffusion models,” inInternational Confer- ence on Machine Learning, pp. 17564–17579, PMLR, 2023

2023

[16] [16]

Mixed-type tabular data synthesis with score-based diffusion in latent space,

H. Zhang, J. Zhang, B. Srinivasan, Z. Shen, X. Qin, C. Faloutsos, H. Rangwala, and G. Karypis, “Mixed-type tabular data synthesis with score-based diffusion in latent space,” inThe twelfth International Conference on Learning Representations, 2024

2024

[17] [17]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” in2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18, IEEE, 2017

2017

[18] [18]

Syn- thetic data generation for tabular health records: A systematic review,

M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Syn- thetic data generation for tabular health records: A systematic review,” Neurocomputing, vol. 493, pp. 28–45, 2022

2022

[19] [19]

Preserving privacy in healthcare: A systematic review of deep learning approaches for synthetic data generation,

Y . Liu, U. R. Acharya, and J. H. Tan, “Preserving privacy in healthcare: A systematic review of deep learning approaches for synthetic data generation,”Computer Methods and Programs in Biomedicine, vol. 260, p. 108571, 2025

2025

[20] [20]

k-anonymity: A model for protecting privacy,

L. Sweeney, “k-anonymity: A model for protecting privacy,”Interna- tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002

2002

[21] [21]

l-diversity: Privacy beyond k-anonymity,

A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond k-anonymity,”ACM Transactions on Knowl- edge Discovery from Data (TKDD), vol. 1, no. 1, pp. 3–es, 2007

2007

[22] [22]

t-closeness: Privacy beyond k- anonymity and l-diversity,

N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k- anonymity and l-diversity,” in2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115, IEEE, 2007

2007

[23] [23]

The future of digital health with federated learning,

N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, S. Herrmann, J. Shotton, J. Trees, B. Kainz, R. Cobb, B. Glocker, and D. Rueckert, “The future of digital health with federated learning,”npj Digital Medicine, vol. 3, no. 1, p. 119, 2020

2020

[24] [24]

Federated learning for healthcare: Systematic review and architecture proposal,

J. Park, J. Yoon, S. Keum, J. Oh, M. Lee, and J.-W. Kim, “Federated learning for healthcare: Systematic review and architecture proposal,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 5, pp. 1478–1491, 2021

2021

[25] [25]

Deep leakage from gradients,

L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” inAd- vances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 14774–14784, 2019

2019

[26] [26]

Differential privacy,

C. Dwork, “Differential privacy,” inAutomata, Languages and Program- ming(M. Bugliesi, B. Preneel, V . Sassone, and I. Wegener, eds.), (Berlin, Heidelberg), pp. 1–12, Springer Berlin Heidelberg, 2006

2006

[27] [27]

The algorithmic foundations of differential privacy,

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,”Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014

2014

[28] [28]

Dp-gan: Differentially private consecutive data publishing using generative adversarial nets,

S. Ho, Y . Qu, B. Gu, L. Gao, J. Li, and Y . Xiang, “Dp-gan: Differentially private consecutive data publishing using generative adversarial nets,” Journal of Network and Computer Applications, vol. 185, p. 103066, 2021

2021

[29] [29]

Dp-ctgan: Differentially pri- vate medical data generation using ctgans,

M. L. Fang, D. S. Dhami, and K. Kersting, “Dp-ctgan: Differentially pri- vate medical data generation using ctgans,” inInternational Conference on Artificial Intelligence in Medicine, pp. 178–188, Springer, 2022

2022

[30] [30]

Pate-gan: Generating synthetic data with differential privacy guarantees,

J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating synthetic data with differential privacy guarantees,” inInternational conference on learning representations, 2018

2018

[31] [31]

Membership inference attacks from first principles,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, ´U. Erlingsson,et al., “Membership inference attacks from first principles,” in2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914, IEEE, 2021

1914

[32] [32]

Synthetic data — what, why and how?,

J. Jordon, L. Szpruch, F. Houssiau, M. Bottarelli, G. Cherubin, C. Maple, S. N. Cohen, and A. Weller, “Synthetic data — what, why and how?,” arXiv preprint arXiv:2205.03257, 2022

work page arXiv 2022

[33] [33]

A comprehensive evaluation frame- work for synthetic medical tabular data generation,

A. Kurakova and H. Homayouni, “A comprehensive evaluation frame- work for synthetic medical tabular data generation,”ACM Transactions on Computing for Healthcare, 2024. Submitted for publication

2024

[34] [34]

SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering,

A. Ilaty, H. Shirazi, and H. Homayouni, “SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering,” 2025. 10 Pages, 2 Supplementary Pages, 6 Tables

2025

[35] [35]

Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data,

A. D. Lautrup, T. Hyrup, A. Zimek, and P. Schneider-Kamp, “Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data,”Data Mining and Knowledge Discovery, vol. 39, Dec. 2024

2024

[36] [36]

Multiplier and gradient methods,

M. R. Hestenes, “Multiplier and gradient methods,”Journal of Opti- mization Theory and Applications, vol. 4, no. 5, pp. 303–320, 1969

1969

[37] [37]

A method for nonlinear constraints in minimization problems,

M. J. D. Powell, “A method for nonlinear constraints in minimization problems,”Optimization, pp. 283–298, 1969. 14

1969

[38] [38]

The multiplier method of Hestenes and Powell applied to convex programming,

R. T. Rockafellar, “The multiplier method of Hestenes and Powell applied to convex programming,”Journal of Optimization Theory and Applications, vol. 12, no. 6, pp. 555–562, 1973

1973

[39] [39]

D. P. Bertsekas,Constrained Optimization and Lagrange Multiplier Methods. New York, NY , USA: Academic Press, 2014

2014

[40] [40]

Stochastic inexact augmented lagrangian method for nonconvex expectation constrained optimization,

Z. Li, P.-Y . Chen, S. Liu, S. Lu, and Y . Xu, “Stochastic inexact augmented lagrangian method for nonconvex expectation constrained optimization,”Computational Optimization and Applications, vol. 87, no. 1, pp. 117–147, 2024

2024

[41] [41]

Learning constrained optimization with deep augmented lagrangian methods,

J. Kotary and F. Fioretto, “Learning constrained optimization with deep augmented lagrangian methods,” 2024

2024

[42] [42]

Two-Player Games for Efficient Non-Convex Constrained Optimization

A. Cotter, H. Jiang, and K. Sridharan, “Two-player games for efficient non-convex constrained optimization,”arXiv preprint arXiv:1804.06500, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

A multidimensional version of the kolmogorov–smirnov test,

G. Fasano and A. Franceschini, “A multidimensional version of the kolmogorov–smirnov test,”Monthly Notices of the Royal Astronomical Society, vol. 225, no. 1, pp. 155–170, 1987

1987

[44] [44]

On information and sufficiency,

S. Kullback and R. A. Leibler, “On information and sufficiency,”The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951

1951

[45] [45]

The jensen-shannon divergence,

M. Men ´endez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307– 318, 1997

1997

[46] [46]

Synthetic data metrics,

“Synthetic data metrics,” 04 2024. Version 0.14.0

2024

[47] [47]

On the generation and evaluation of tabular data using gans,

B. Brenninkmeijer, A. de Vries, E. Marchiori, and Y . Hille, “On the generation and evaluation of tabular data using gans,”PhD diss., Radboud University, 2019

2019

[48] [48]

Feature selection based on mutual information with correlation coefficient,

H. Zhou, X. Wang, and R. Zhu, “Feature selection based on mutual information with correlation coefficient,”Applied intelligence, vol. 52, no. 5, pp. 5457–5474, 2022

2022

[49] [49]

Tabsyndex: A universal metric for robust evaluation of synthetic tabular data,

V . S. Chundawat, A. K. Tarun, M. Mandal, M. Lahoti, and P. Narang, “Tabsyndex: A universal metric for robust evaluation of synthetic tabular data,”arXiv preprint arXiv:2207.05295, 2022

work page arXiv 2022

[50] [50]

Using dynamic time warping to find pat- terns in time series,

D. J. Berndt and J. Clifford, “Using dynamic time warping to find pat- terns in time series,” inProceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94, p. 359–370, AAAI Press, 1994

1994

[51] [51]

Tunnicliffe Wilson, “Time series analysis: Forecasting and control,5th edition, by george e

G. Tunnicliffe Wilson, “Time series analysis: Forecasting and control,5th edition, by george e. p. box, gwilym m. jenkins, gregory c. reinsel and greta m. ljung, 2015. published by john wiley and sons inc., hoboken, new jersey, pp. 712. isbn: 978-1-118-67502-1,”Journal of Time Series Analysis, vol. 37, pp. n/a–n/a, 03 2016

2015

[52] [52]

Data Synthesis based on Generative Adversarial Networks

N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y . Kim, “Data synthesis based on generative adversarial networks,”arXiv preprint arXiv:1806.03384, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[53] [53]

UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set

D. Dua and C. Graff, “UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set.” UCI Machine Learning Repository,

[54] [54]

University of California, Irvine, School of Information and Computer Sciences

[55] [55]

Incidence of diagnosed diabetes in adults — united states, 1980–2014,

N. R. Burrows, I. Hora, L. S. Geiss, E. W. Gregg, and A. Albright, “Incidence of diagnosed diabetes in adults — united states, 1980–2014,” MMWR Morbidity and Mortality Weekly Report, vol. 66, no. 12, pp. 306– 309, 2017

1980

[56] [56]

Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone,

D. Chicco and G. Jurman, “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone,”BMC Medical Informatics and Decision Making, vol. 20, no. 16, 2020

2020

[57] [57]

Generating production rules from decision trees,

J. R. Quinlan, “Generating production rules from decision trees,” in Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI), pp. 304–307, 1987

1987

[58] [58]

Bupa liver disorders dataset

B. M. R. Ltd., “Bupa liver disorders dataset.” https://archive.ics.uci.edu/ ml/datasets/Liver+Disorders, 1990. UCI Machine Learning Repository

1990

[59] [59]

Uci machine learning repository

D. Dua and C. Graff, “Uci machine learning repository.” https://archive. ics.uci.edu/ml/datasets/Lung+Cancer, 2019. Lung Cancer Dataset

2019

[60] [60]

Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from colombia, peru and mexico,

F. M. Palechor and A. de la Hoz Manotas, “Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from colombia, peru and mexico,”Data in Brief, vol. 25, p. 104344, 2019

2019

[61] [61]

Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,

M. A. Little, P. E. McSharry, S. J. Roberts, D. A. E. Costello, and I. M. Moroz, “Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,”BioMedical Engineering OnLine, vol. 6, no. 23, 2007

2007

[62] [62]

Uci machine learning repository

D. Dua and C. Graff, “Uci machine learning repository.” https://archive. ics.uci.edu/ml, 2019. Adult Census Income Dataset

2019

[63] [63]

PIRvision FoG presence detection

M. Emad-ud din, “PIRvision FoG presence detection.” UCI Machine Learning Repository, 2023

2023

[64] [64]

Vietnam banking transaction dataset for fraud detection

H. T. Nguyen and T. N. Tran, “Vietnam banking transaction dataset for fraud detection.” Public financial transaction dataset, 2020. If sourced from Kaggle or Zenodo, include DOI here

2020

[65] [65]

Democra- tizing tabular data access with an open-source synthetic-data sdk,

I. Krchova, M. V . Vieyra, M. Scriminaci, and A. Sidorenko, “Democra- tizing tabular data access with an open-source synthetic-data sdk,” 2025

2025

[66] [66]

El Emam and L

K. El Emam and L. Mosquera,Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. O’Reilly Media, 2020

2020

[67] [67]

General and specific utility measures for synthetic data,

J. Snoke, G. M. Raab, B. Nowok, C. Dibben, and A. Slavkovic, “General and specific utility measures for synthetic data,”Journal of the Royal Statistical Society: Series A, vol. 181, no. 3, pp. 663–688, 2018

2018

[68] [68]

The dcr delusion: Measuring the privacy risk of synthetic data,

Z. Yao, N. Kr ˇco, G. Ganev, and Y .-A. de Montjoye, “The dcr delusion: Measuring the privacy risk of synthetic data,” inComputer Security – ESORICS 2025(V . Nicomette, A. Benzekri, N. Boulahia-Cuppens, and J. Vaidya, eds.), (Cham), pp. 469–487, Springer Nature Switzerland, 2026

2025

[69] [69]

El Emam, L

K. El Emam, L. Mosquera, and R. Hoptroff,Practical synthetic data generation : balancing privacy and the broad availability of data / Khaled El Emam, Lucy Mosquera, and Richard Hoptroff.Sebastopol, CA: O’Reilly Media, 1st edition ed., 2020

2020

[70] [70]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” 2017

2017

[71] [71]

Evaluating differentially private machine learning in practice,

B. Jayaraman and D. Evans, “Evaluating differentially private machine learning in practice,” 2019

2019

[72] [72]

P. J. Huber and E. M. Ronchetti,Robust Statistics. Wiley, 2nd ed., 2009

2009

[73] [73]

Dwork and A

C. Dwork and A. Roth,The Algorithmic Foundations of Differential Privacy, vol. 9 ofFoundations and Trends in Theoretical Computer Science. Now Publishers Inc., 2014

2014

[74] [74]

Reinforced Augmented La- grangian for constrained optimization in deep learning,

H. Yuan, X. Lian, J. Li, J. Liu, and B. Xu, “Reinforced Augmented La- grangian for constrained optimization in deep learning,”arXiv preprint arXiv:2106.01134, 2021. 15 APPENDIX Table IX provides an exhaustive, multi-page breakdown of downstream predictive efficacy across all classifiers and sampling strategies. This summary highlights the Train-on- Synthe...

work page arXiv 2021