pith. machine review for the scientific record. sign in

arxiv: 2512.16284 · v2 · submitted 2025-12-18 · 💻 cs.CR

Recognition: no theorem link

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:40 UTC · model grok-4.3

classification 💻 cs.CR
keywords synthetic dataprivacy metricstabular dataprivacy quantificationempirical evaluationrisk insertionno-box threat modelprivacy enhancing technology
0
0 comments X

The pith

A framework evaluates tabular synthetic data privacy metrics by deliberately inserting known risks and measuring detection under no-box conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an experimental framework that inserts specific, controlled privacy risks into synthetic tabular datasets and then applies existing privacy quantification methods to see if those risks are detected. This matters for practitioners because synthetic data is widely promoted as a privacy-enhancing technology, yet without reliable empirical tests it is hard to know which quantification methods actually work or how to compare them. The authors survey current approaches and related legal theory, then demonstrate the framework on public datasets using no-box threat models where the attacker has no access to the original data. If the framework holds, it gives a repeatable way to benchmark privacy claims rather than relying on theoretical assurances alone.

Core claim

The authors claim that deliberately inserting specific privacy risks into synthetic tabular data, then testing whether quantification methods detect them, provides an empirical way to assess the efficacy of those methods under no-box threat models on publicly available datasets.

What carries the argument

Controlled deliberate risk insertion into synthetic tabular data to test privacy quantification methods.

If this is right

  • Privacy metrics can be ranked by their ability to detect deliberately inserted risks across multiple datasets.
  • The framework supplies a repeatable benchmark for comparing synthetic data generators on privacy grounds.
  • Legal compliance arguments for synthetic data can be grounded in empirical detection rates rather than theoretical properties alone.
  • No-box threat models become testable rather than assumed, allowing direct measurement of residual identification risk.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same insertion approach could be adapted to test privacy metrics on non-tabular data such as time series or graphs.
  • If insertion patterns match known real-world breach vectors, regulators could adopt the framework as a certification test for privacy-enhancing technologies.
  • Extending the method to measure utility loss alongside risk detection would reveal trade-offs that current single-metric evaluations miss.

Load-bearing premise

Deliberately inserting specific risks into synthetic data accurately models real-world privacy threats under no-box conditions.

What would settle it

Run the framework on a dataset where the inserted risks are known to be undetectable by any metric or where real-world attacks succeed despite the metrics reporting low risk; mismatch between inserted risks and metric scores would falsify the framework's validity.

Figures

Figures reproduced from arXiv: 2512.16284 by Alexander Boudewijn, Andrea Filippo Ferraris, Daniele Panfilo, Diana Sofronieva, Filiberto Brozzetti, Giuseppe D'Acquisto, Luca Bortolussi, Milton Nicol\'as Plasencia Palacios, Sebastiano Saccani.

Figure 1
Figure 1. Figure 1: Overview of synthetic data privacy quantification methods [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Risk assessment methods evaluated using the leaky risk model [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Risk assessment methods evaluated using the overfit risk model - RTF [61] [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Risk assessment methods evaluated using the overfit risk model - Synthpop [62] [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Risk assessment methods evaluated using the DP risk model - PATEGAN [39] [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Risk assessment methods evaluated using the DP risk model - AIM [63] [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: correlation matrices risk assessment methods using leaky risk model (from left to right: Adult, Texas, Census [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: correlation matrices risk assessment methods using overfitting risk model - RTF(from left to right: Adult, [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: correlation matrices risk assessment methods using overfitting risk model - Synthpop(from left to right: Adult, [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: correlation matrices risk assessment methods using DP risk model - AIM (from left to right: Adult, Texas, [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: correlation matrices risk assessment methods using DP risk model - PATEGAN (from left to right: Adult, [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Results with k-NN-based privacy metric for various k using the leaky risk model [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Results with k-NN-based privacy metric for various k using the overfit risk model - RTF [61] [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Results with k-NN-based privacy metric for various k using the overfit risk model - Synthpop [62] C Experiments with outlier removal We used the local outlier factor (LOF) to embeddings obtained through contrastive learning to identify outliers [75]. Different proportions of outliers were removed to evaluate their impact on the measurements. We used the overfitting risk model, with overfit ratios of 1.0 a… view at source ↗
Figure 15
Figure 15. Figure 15: Results with k-NN-based privacy metric for various k using the DP risk model - PATEGAN [39] [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Results with k-NN-based privacy metric for various k using the DP risk model - AIM [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: IMS, DCR, and MIA with outlier removal in the original dataset prior to generator training, with both no [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: GTCAP and ML Inference with outlier removal in the original dataset prior to generator training, with both [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Anonymeter’s methods with outlier removal in the original dataset prior to generator training, with both no [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: IMS, DCR, and MIA with outlier removal in the original dataset prior to generator training, with both no [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: GTCAP and ML Inference with outlier removal in the original dataset prior to generator training, with both [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Anonymeter’s methods with outlier removal in the original dataset prior to generator training, with both no [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Comparison between Canary Record Baseline and Risk on the training set using the leaky risk model [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Results with GTCAP privacy metric for various radiuses using the overfit risk model [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: MLE utility scores of synthetic datasets per dataset and generator (utility of training set for reference) [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
read the original abstract

Synthetic data generation is gaining traction as a privacy enhancing technology (PET). When properly generated, synthetic data preserve the analytic utility of real data while avoiding the retention of information that would allow the identification of specific individuals. However, the concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data. In this paper, we propose a framework to empirically assess the efficacy of tabular synthetic data privacy quantification methods through controlled, deliberate risk insertion. To demonstrate this framework, we survey existing approaches to synthetic data privacy quantification and the related legal theory. We then apply the framework to the main privacy quantification methods with no-box threat models on publicly available datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce an empirical framework for assessing the efficacy of tabular synthetic data privacy quantification methods by using controlled, deliberate risk insertion to create ground-truth labels for benchmarking under no-box threat models. It surveys existing privacy quantification approaches and legal theory, then applies the framework to main methods on public datasets.

Significance. If the framework is shown to be valid, it would be significant for the field by providing a practical method to empirically evaluate and compare privacy metrics for synthetic data, helping to address the elusive nature of privacy assessment in privacy-enhancing technologies.

major comments (2)
  1. [Proposed framework] The framework's reliance on deliberate risk insertion to simulate no-box attacker capabilities is a load-bearing assumption that lacks justification. Inserting risks requires knowledge of original records or generator internals, which is unavailable in a true no-box setting where only the synthetic table is provided; this may cause the inserted signals to be either undetectable by realistic attacks or detectable only due to procedural artifacts, undermining the ground-truth mapping.
  2. [Abstract] Although the abstract states that the framework is applied to main privacy quantification methods, no concrete results, validation steps, or error analysis are provided, leaving the central empirical claim without demonstrated support.
minor comments (1)
  1. [Survey of approaches] The survey of existing approaches to synthetic data privacy quantification and legal theory would benefit from clearer integration with how they inform the design of the proposed framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, clarifying the framework design and manuscript content while committing to targeted revisions for improved justification and clarity.

read point-by-point responses
  1. Referee: [Proposed framework] The framework's reliance on deliberate risk insertion to simulate no-box attacker capabilities is a load-bearing assumption that lacks justification. Inserting risks requires knowledge of original records or generator internals, which is unavailable in a true no-box setting where only the synthetic table is provided; this may cause the inserted signals to be either undetectable by realistic attacks or detectable only due to procedural artifacts, undermining the ground-truth mapping.

    Authors: The framework deliberately separates the experimental setup (controlled risk insertion by the evaluator to establish ground-truth labels) from the threat model under which the metrics are tested (strictly no-box, with metrics receiving only the synthetic table). This is analogous to standard benchmarking practices in privacy research, such as membership inference evaluations that use known labels for ground truth while testing black-box attacks. We acknowledge that the manuscript would benefit from explicit discussion of this distinction, potential procedural artifacts, and limitations in simulating realistic no-box conditions. We will revise the methodology section to provide this justification and analysis. revision: yes

  2. Referee: [Abstract] Although the abstract states that the framework is applied to main privacy quantification methods, no concrete results, validation steps, or error analysis are provided, leaving the central empirical claim without demonstrated support.

    Authors: The abstract is intentionally concise and summarizes the application at a high level. Concrete results—including quantitative comparisons of privacy metrics, validation across public datasets, and error analysis—are detailed in the experimental sections of the full manuscript. To improve accessibility, we will revise the abstract to include a brief mention of key empirical outcomes and validation approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework applies existing methods without self-referential reduction

full rationale

The paper proposes an empirical evaluation framework that inserts controlled privacy risks into synthetic tabular data and then applies existing no-box privacy quantification methods to public datasets. No equations, derivations, fitted parameters, or uniqueness theorems are present. The central claim rests on the methodological design of risk insertion and benchmarking rather than any reduction to self-citations, self-definitions, or renamed known results. The approach is self-contained as an application of prior independent quantification techniques, with no load-bearing step that collapses to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about threat models and the validity of risk insertion as a proxy for privacy evaluation.

axioms (2)
  • domain assumption No-box threat models are appropriate for assessing synthetic data privacy risks
    The paper explicitly applies the framework under no-box threat models.
  • domain assumption Deliberate insertion of known risks can serve as a valid ground truth for metric evaluation
    Central to the proposed empirical assessment method.

pith-pipeline@v0.9.0 · 5452 in / 1092 out tokens · 27838 ms · 2026-05-16T21:40:26.121341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 1 internal anchor

  1. [1]

    Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.npj Digital Medicine, 6, 10 2023

    Mauro Giuffrè and Dennis Shung. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.npj Digital Medicine, 6, 10 2023

  2. [2]

    Generation and evaluation of synthetic patient data.BMC Medical Research Methodology, 2020

    Andre Goncalves, Priyadip Ray, Braden Soper, Jennifer Stevens, Linda Coyle, and Anna Paula Slaes. Generation and evaluation of synthetic patient data.BMC Medical Research Methodology, 2020

  3. [3]

    Vamsi K. Potluru, Daniel Borrajo, Andrea Coletta, Niccolò Dalmasso, Yousef El-Laham, Elizabeth Fons, Mohsen Ghassemi, Sriram Gopalakrishnan, Vikesh Gosai, Eleonora Kreaˇci´c, Ganapathy Mani, Saheed Obitayo, Deepak Paramanand, Natraj Raman, Mikhail Solonin, Srijan Sood, Svitlana Vyetrenko, Haibei Zhu, Manuela Veloso, and Tucker Balch. Synthetic data applic...

  4. [4]

    Gal and Orla Lynskey

    Michal S. Gal and Orla Lynskey. Synthetic data: Legal implications of the data-generation revolution.Iowa Law Review, 109(1087):1087–1154, 2024

  5. [5]

    Synthetic data: A look back and a look forward.Trans

    Jerome P Reiter. Synthetic data: A look back and a look forward.Trans. Data Priv., 16(1):15–24, 2023

  6. [6]

    Synthetic data protection: Towards a paradigm change in data regulation?Big Data & Society, 11(1):20539517241231277, 2024

    Ana Beduschi. Synthetic data protection: Towards a paradigm change in data regulation?Big Data & Society, 11(1):20539517241231277, 2024

  7. [7]

    Bellovin, K

    Steven M. Bellovin, K. Dutta, Preetam, and N. Reitinger. Privacy and synthetic datasets.Stanford Technology Law Review, 2018

  8. [8]

    Practical privacy metrics for synthetic data, 2024

    Gillian M Raab, Beata Nowok, and Chris Dibben. Practical privacy metrics for synthetic data, 2024

  9. [9]

    Ferraris, Daniele Panfilo, Vanessa Cocca, Sabrina Zinutti, Karel Schepper, and Carlo Chauvenet

    Alexander Boudewijn, Andrea F. Ferraris, Daniele Panfilo, Vanessa Cocca, Sabrina Zinutti, Karel Schepper, and Carlo Chauvenet. Privacy measurement in tabular synthetic data: State of the art and future research directions. 10 2023. 19 APREPRINT- DECEMBER19, 2025

  10. [10]

    A unified framework for quantifying privacy risk in synthetic data, 2022

    Matteo Giomi, Franziska Boenisch, Christoph Wehmeyer, and Borbála Tasnádi. A unified framework for quantifying privacy risk in synthetic data, 2022

  11. [11]

    Gonzales, G

    A. Gonzales, G. Guruswamy, and S. R. Smith. Synthetic data in health care: A narrative review.PLOS Digital Health, 2(1):e0000082, 2023

  12. [12]

    Giuffrè and D

    M. Giuffrè and D. L. Shung. Harnessing the power of synthetic data in healthcare: Innovation, application, and privacy.NPJ Digital Medicine, 6(1):186, 2023

  13. [13]

    Cohen, and Adrian Weller

    James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N. Cohen, and Adrian Weller. Synthetic data - what, why and how? 2022

  14. [14]

    Hradec, M

    J. Hradec, M. Craglia, M. Di Leo, S. De Nigris, N. Ostlaender, and N. Nicholson.Multipurpose synthetic population for policy applications. JRC technical report, 2022

  15. [15]

    Patient-centric synthetic data generation, no reason to risk reidentification in biomedical data analysis.NPJ Digital Medicine, (6(37)), 2023

    Morgan Guillaudeux, Olicia Rousseau, Julien Petot, Zineb Bennis, Charles-Axel Dein, Thomas Goronflot, Nicolas Vince, Sophie Limou, Matilde Karakachoff, Matthieu Wargny, and Pierre-Antoine Gourraud. Patient-centric synthetic data generation, no reason to risk reidentification in biomedical data analysis.NPJ Digital Medicine, (6(37)), 2023

  16. [16]

    Buckeridge, and Khaled El Emam

    Jean-Francois Rajotte, Robert Bergen, David L. Buckeridge, and Khaled El Emam. Synthetic data as an enabler for machine learning applications in medicine.iScience, 25(11):105331, 2022

  17. [17]

    Boudewijn and A

    A. Boudewijn and A. F. Ferraris. Legal and regulatory perspectives on synthetic data as an anonymization strategy. Journal of Personal Data Protection Law, 1:17–31, 2024

  18. [18]

    Cohen, and Adrian Weller

    James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N. Cohen, and Adrian Weller. Synthetic data – what, why and how?, 2022

  19. [19]

    Survey on synthetic data generation, evaluation methods and gans.Mathematics, 10(15), 2022

    Alvaro Figueira and Bruno Vaz. Survey on synthetic data generation, evaluation methods and gans.Mathematics, 10(15), 2022

  20. [20]

    A survey on generative adversarial networks: Variants, applications, and training, 2020

    Abdul Jabbar, Xi Li, and Bourahla Omar. A survey on generative adversarial networks: Variants, applications, and training, 2020. [21]Opinion 4/2007 on the concept of personal data, 2007

  21. [21]

    Opinion 05/2014 on anonymisation techniques, 2014

    Article 29 Working Party on the Protection of Individuals with regard to the Processing of Personal Data. Opinion 05/2014 on anonymisation techniques, 2014

  22. [22]

    López and Abdullah Elbi

    Cesar A.F. López and Abdullah Elbi. On the legal nature of synthetic data.Proceedings of the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022

  23. [23]

    Advancing microdata privacy protection: A review of synthetic data, 2023

    Jingchen Hu and Claire McKay Bowen. Advancing microdata privacy protection: A review of synthetic data, 2023

  24. [24]

    Failure modes in machine learning systems, 2019

    Ram Shankar Siva Kumar, David O Brien, Kendra Albert, Salomé Viljöen, and Jeffrey Snover. Failure modes in machine learning systems, 2019

  25. [25]

    Synthetic data – anonymisation groundhog day, 2022

    Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. Synthetic data – anonymisation groundhog day, 2022

  26. [26]

    Lisa Pilgram, Fida K. Dankar, Jorg Drechsler, Mark Elliot, Josep Domingo-Ferrer, Paul Francis, Murat Kantar- cioglu, Linglong Kong, Bradley Malin, Krishnamurty Muralidhar, Puja Myles, Fabian Prasser, Jean Louis Raisaro, Chao Yan, and Khaled El Emam. A consensus privacy metrics framework for synthetic data, 2025

  27. [27]

    Can we trust synthetic data in medicine? a scoping review of privacy and utility metrics, 11 2023

    Bayrem Kaabachi, Jérémie Despraz, Thierry Meurers, Karen Otte, Mehmed Halilovic, Fabian Prasser, and Jean Louis Raisaro. Can we trust synthetic data in medicine? a scoping review of privacy and utility metrics, 11 2023

  28. [28]

    Privacy mechanisms and evaluation metrics for synthetic data generation: A systematic review.IEEE Access, 2024

    Pablo A Osorio-Marulanda, Gorka Epelde, Mikel Hernandez, Imanol Isasa, Nicolas Moreno Reyes, and An- doni Beristain Iraola. Privacy mechanisms and evaluation metrics for synthetic data generation: A systematic review.IEEE Access, 2024

  29. [29]

    Standardised metrics and methods for synthetic tabular data evaluation, 09 2021

    Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Standardised metrics and methods for synthetic tabular data evaluation, 09 2021

  30. [30]

    A multifaceted benchmarking of synthetic electronic health record generation models.Nature Communications, 13, 12 2022

    Chao Yan, Yao Yan, Zhiyu Wan, Ziqi Zhang, Larsson Omberg, Justin Guinney, Sean Mooney, and Bradley Malin. A multifaceted benchmarking of synthetic electronic health record generation models.Nature Communications, 13, 12 2022

  31. [31]

    Sure: A new privacy and utility assessment library for synthetic data

    Dario Brunelli, Shalini Kurapati, and Luca Gilli. Sure: A new privacy and utility assessment library for synthetic data. In2024 IEEE International Conference on Blockchain (Blockchain), pages 643–648, 2024. 20 APREPRINT- DECEMBER19, 2025

  32. [32]

    Membership inference attacks against synthetic data through overfitting detection, 2023

    Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. Membership inference attacks against synthetic data through overfitting detection, 2023

  33. [33]

    PhD thesis, 2024

    Steven Golob.Privacy Vulnerabilities in Marginals-based Synthetic Data. PhD thesis, 2024

  34. [34]

    Achilles’ heels: Vulnerable record identification in synthetic data publishing, 2023

    Matthieu Meeus, Florent Guepin, Ana-Maria Cretu, and Yves-Alexandre de Montjoye. Achilles’ heels: Vulnerable record identification in synthetic data publishing, 2023

  35. [35]

    The algorithmic foundations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

    Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

  36. [36]

    Differentially private synthetic data: Applied evaluations and enhancements, 2020

    Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar, Eduardo de Leon, Anuj Desai, and Joshua Allen. Differentially private synthetic data: Applied evaluations and enhancements, 2020

  37. [37]

    Differentially private generative adversarial network, 2018

    Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially private generative adversarial network, 2018

  38. [38]

    PATE-GAN: Generating synthetic data with differential privacy guarantees

    Jinsung Yoon, James Jordon, and Mihaela van der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. InInternational Conference on Learning Representations, 2019

  39. [39]

    Geyer, Tassilo Klein, and Moin Nabi

    Robin C. Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective, 2018

  40. [40]

    SDNist v2: Deidentified Data Report Tool, March 2023

    Christine Task, Karan Bhagat, and Gary Howarth. SDNist v2: Deidentified Data Report Tool, March 2023

  41. [41]

    truly anonymous synthetic data

    Georgi Ganev and Emiliano De Cristofaro. On the inadequacy of similarity-based privacy metrics: Reconstruction attacks against" truly anonymous synthetic data”.arXiv preprint arXiv:2312.05114, 2023

  42. [42]

    Gillian M. Raab. Utility and disclosure risk for differentially private synthetic categorical data. In Josep Domingo- Ferrer and Maryline Laurent, editors,Privacy in Statistical Databases, pages 250–265, Cham, 2022. Springer International Publishing

  43. [43]

    Smooth-gan: Towards sharp and smooth synthetic ehr data generation

    Sina Rashidian, Fusheng Wang, Richard Moffitt, Victor Garcia, Anurag Dutt, Wei Chang, Vishwam Pandya, Janos Hajagos, Mary Saltz, and Joel Saltz. Smooth-gan: Towards sharp and smooth synthetic ehr data generation. In Martin Michalowski and Robert Moskovitch, editors,Artificial Intelligence in Medicine, pages 37–48, Cham,

  44. [44]

    Springer International Publishing

  45. [45]

    Hyland, and Gunnar Rätsch

    Cristóbal Esteban, Stephanie L. Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans, 2017

  46. [46]

    A deep learning-based pipeline for the generation of synthetic tabular data

    Daniele Panfilo, Alexander Boudewijn, Sebastiano Saccani, Andrea Coser, Borut Svara, Carlo Rossi Chauvenet, Ciro Antonio Mami, and Eric Medvet. A deep learning-based pipeline for the generation of synthetic tabular data. IEEE Access, pages 1–1, 2023

  47. [47]

    Holdout-based empirical assessment of mixed-type synthetic data

    Michael Platzer and Thomas Reutterer. Holdout-based empirical assessment of mixed-type synthetic data. Frontiers in Big Data, 2021

  48. [48]

    Nfdi4health workflow and service for synthetic data generation, assessment and risk management, 2024

    Sobhan Moazemi, Tim Adams, Hwei Geok NG, Lisa Kühnel, Julian Schneider, Anatol-Fiete Näher, Juliane Fluck, and Holger Fröhlich. Nfdi4health workflow and service for synthetic data generation, assessment and risk management, 2024

  49. [49]

    Stewart, and Jimeng Sun

    Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks, 2018

  50. [50]

    M. Naldi G. D’Acquisto.Big Data e Privacy by design. Anonimizzazione, pseudonimizzazione, sicurezza. Giappichelli, 2017

  51. [51]

    Logan: Membership inference attacks against generative models, 2018

    Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. Logan: Membership inference attacks against generative models, 2018

  52. [52]

    Gan-leaks: A taxonomy of membership inference attacks against generative models

    Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. Gan-leaks: A taxonomy of membership inference attacks against generative models. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security, pages 343–362, 2020

  53. [53]

    Monte carlo and reconstruction membership inference attacks against generative models.Proceedings on Privacy Enhancing Technologies, 2019

    Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. Monte carlo and reconstruction membership inference attacks against generative models.Proceedings on Privacy Enhancing Technologies, 2019

  54. [54]

    Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch

    Florimond Houssiau, James Jordon, Samuel N. Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch. Tapas: a toolbox for adversarial privacy auditing of synthetic data, 2022

  55. [55]

    Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method, 2023

    Rémy Chapelle and Bruno Falissard. Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method, 2023. 21 APREPRINT- DECEMBER19, 2025

  56. [56]

    Creating the best risk-utility profile : The synthetic data challenge, 2015

    Mark Elliot. Creating the best risk-utility profile : The synthetic data challenge, 2015

  57. [57]

    The trade-off between information utility and disclosure risk in a ga synthetic data generator.Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 2019

    Yingrui Chen, Jennifer Taub, and Mark Elliot. The trade-off between information utility and disclosure risk in a ga synthetic data generator.Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 2019

  58. [58]

    The privacy onion effect: Memorization is relative, 2022

    Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, and Florian Tramer. The privacy onion effect: Memorization is relative, 2022

  59. [59]

    Membership inference attacks against gans by leveraging over-representation regions

    Hailong Hu and Jun Pang. Membership inference attacks against gans by leveraging over-representation regions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 2387–2389, New York, NY , USA, 2021. Association for Computing Machinery

  60. [60]

    A linear reconstruction approach for attribute inference attacks against synthetic data, 2024

    Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, and Luc Rocher. A linear reconstruction approach for attribute inference attacks against synthetic data, 2024

  61. [61]

    Solatorio and Olivier Dupriez

    Aivin V . Solatorio and Olivier Dupriez. Realtabformer: Generating realistic relational and tabular data using transformers, 2023

  62. [62]

    Raab, and Chris Dibben

    Beata Nowok, Gillian M. Raab, and Chris Dibben. Synthpop: Bespoke creation of synthetic data in r.Journal of Statistical Software, 74(11):1–26, 2016

  63. [63]

    The smartnoise system: Aim, 2024

    Open Differential Privacy (OpenDP). The smartnoise system: Aim, 2024. https://github.com/opendp/ smartnoise-sdk

  64. [64]

    Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

    Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data.arXiv preprint arXiv:1610.05755, 2016

  65. [65]

    Sure: A new privacy and utility assessment library for synthetic data

    Dario Brunelli, Shalini Kurapati, and Luca Gilli. Sure: A new privacy and utility assessment library for synthetic data. In2024 IEEE International Conference on Blockchain (Blockchain), pages 643–648, 2024

  66. [66]

    Synthcity: facilitating innovative use cases of synthetic data in different data modalities, 2023

    Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. Synthcity: facilitating innovative use cases of synthetic data in different data modalities, 2023

  67. [67]

    Jennifer Taub, Mark James Elliot, and Gillian M. Raab. Creating the best risk-utility profile : The synthetic data challenge. 2019

  68. [68]

    Block neural autoregressive flow

    Nicola De Cao, Wilker Aziz, and Ivan Titov. Block neural autoregressive flow. InUncertainty in artificial intelligence, pages 1263–1273. PMLR, 2020

  69. [69]

    Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

  70. [70]

    Texas hospital discharge data pub- lic use data., 2005

    Center for Health Statistics Texas Department of State Health Services. Texas hospital discharge data pub- lic use data., 2005. https://www.dshs.texas.gov/texas-health-care-information-collection/ general-public-information/hospital-discharge-data-public

  71. [71]

    Ipums usa: Version 8.0 extract of 1940 census for u.s

    Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. Ipums usa: Version 8.0 extract of 1940 census for u.s. census bureau disclosure avoidance research. minneapolis, mn: Ipums, 2008. DOI: https://doi.org/10.18128/D010.V8.0.EXT1940USCB

  72. [72]

    The secret sharer: Evaluating and testing unintended memorization in neural networks, 2019

    Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks, 2019

  73. [73]

    The dcr delusion: Measuring the privacy risk of synthetic data

    Zexi Yao, Nataša Krˇco, Georgi Ganev, and Yves-Alexandre de Montjoye. The dcr delusion: Measuring the privacy risk of synthetic data. InEuropean Symposium on Research in Computer Security, pages 469–487. Springer, 2025

  74. [74]

    From isolation to identification

    Giuseppe D’Acquisto, Aloni Cohen, Maurizio Naldi, and Kobbi Nissim. From isolation to identification. In Privacy in Statistical Databases: International Conference, PSD 2024, Antibes Juan-Les-Pins, France, September 25–27, 2024, Proceedings, page 3–17, Berlin, Heidelberg, 2024. Springer-Verlag

  75. [75]

    Texas”) [ 70], and the 1940 Census full enumeration from IPUMS USA (“Census

    Milton Nicolás Plasencia Palacios, Sebastiano Saccani, Gabriele Sgroi, Alexander Boudewijn, and Luca Bortolussi. Contrastive learning-based privacy metrics in tabular synthetic datasets, 2025. 22 APREPRINT- DECEMBER19, 2025 A Complete experimental results Figure 2: Risk assessment methods evaluated using the leaky risk model Figure 3: Risk assessment meth...