pith. machine review for the scientific record. sign in

arxiv: 2605.06835 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords privacy leakagetabular diffusion modelsmembership inference attackssynthetic tabular dataattacker knowledgeblack-box settingwhite-box settingheuristic privacy metrics
0
0 comments X

The pith

Tabular diffusion models leak membership information even when attackers lack full knowledge of the training setup or data distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines privacy risks when using diffusion models to generate synthetic tabular data. It applies membership inference attacks in black-box and white-box settings to measure how training configurations, synthesis decisions, and attacker capabilities affect leakage. The central result is that successful attacks do not require perfect knowledge of the training process, matching data distributions, or large compute resources. This finding matters for sensitive domains that rely on synthetic tabular proxies to limit exposure of real records. The work also shows that simple heuristic checks, such as distance to the closest training record, can give misleading impressions of privacy.

Core claim

Leveraging state-of-the-art membership inference attacks for tabular diffusion models in both black- and white-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage. The results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks. The pitfalls associated with applying heuristic privacy metrics, such as distance-to-closest record, are also revealed.

What carries the argument

Membership inference attacks applied to tabular diffusion models in black- and white-box settings, used to quantify how much training-record membership information leaks through generated samples.

If this is right

  • Training setup and synthesis parameter choices directly control the amount of membership information that leaks from tabular diffusion models.
  • Effective membership inference remains possible even when the attacker has only approximate knowledge of the training distribution and limited computational budget.
  • Heuristic privacy metrics such as nearest-record distance fail to capture actual leakage and can produce over-optimistic assessments.
  • Privacy evaluation of tabular diffusion models must account for realistic attacker knowledge levels rather than assuming worst-case or perfect information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations deploying synthetic tabular data generators may need to adopt attack-aware training procedures to reduce leakage under partial-knowledge scenarios.
  • The same leakage factors could appear in other generative models for tabular data, suggesting a need to test diffusion-specific results on alternative architectures.
  • Combining multiple attack types or incorporating auxiliary data sources could reveal higher leakage levels than single-attack evaluations show.

Load-bearing premise

State-of-the-art membership inference attacks provide reliable and generalizable indicators of real privacy leakage in tabular diffusion models.

What would settle it

An experiment in which records identified as members by the attacks show no higher reconstruction or exposure rates from model outputs than non-members in a deployed tabular diffusion system.

Figures

Figures reproduced from arXiv: 2605.06835 by Behnoosh Zamanlooy, D. B. Emerson, Elaheh Bassak, Fatemeh Tavakoli, Marcelo Lotif, Masoumeh Shafieinejad, Sara Kodeiri, Xi He.

Figure 1
Figure 1. Figure 1: Training and synthesis levers that influence TF MIA success and DCR for the Berka dataset [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Variations in attacker computing power (left), shadow model mismatch (middle) and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training and synthesis levers vs. Ensemble MIA success and DCR for the Berka dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Variations in attacker computing power (left), shadow model mismatch (middle) and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and synthesis levers that influence TF MIA success and DCR for the Diabetes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Variations in attacker computing power (left), shadow model mismatch (middle) and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training and synthesis levers vs. Ensemble MIA success and DCR for the Diabetes dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Variations in attacker computing power (left), shadow model mismatch (middle) and [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ensemble attack ablations (top) and RMIA shadow model variations (bottom) for the Berka [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training and synthesis levers that influence NNDR, HR, and EIR in comparison with TF [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training and synthesis levers that influence NNDR, HR, and EIR in comparison with TF [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Tabular data plays an important role in many fields and industries, including those with elevated privacy considerations and risks. As such, there is a rising interest in generating high-quality synthetic proxies for real tabular data as a means of reducing privacy risk and proprietary data exposure. With tabular diffusion models (TDMs) demonstrating leading performance in synthesizing such data, understanding and measuring the privacy risks associated with these models is imperative. Leveraging state-of-the-art membership inference attacks for TDMs in both black- and white-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage. Moreover, the results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks. Finally, the pitfalls associated with applying heuristic privacy metrics, such as distance-to-closest record, are revealed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper investigates privacy leakage in tabular diffusion models (TDMs) by adapting state-of-the-art membership inference attacks (MIAs) to both black-box and white-box settings. It empirically quantifies the influence of training setup, synthesis choices, and attacker knowledge on leakage levels, showing that non-zero privacy leakage persists even when adversaries lack perfect knowledge of the training setup, identical data distributions, or large compute resources. The work additionally demonstrates pitfalls in applying heuristic privacy metrics such as distance-to-closest record.

Significance. If the experimental results are robust, the findings are significant for the field of privacy in generative models for tabular data, which is critical in domains like healthcare and finance. The systematic exploration of relaxed attacker assumptions provides concrete evidence that privacy protections cannot rely on limited adversary knowledge, and the critique of heuristic metrics offers practical guidance against over-reliance on them. The empirical design with varying knowledge levels is a clear strength.

minor comments (3)
  1. [§3] §3 (Attack Methodology): the adaptation of white-box MIA to the diffusion denoising process could include a brief pseudocode or equation showing how the score function or noise prediction is used for membership scoring, to improve reproducibility.
  2. [Table 2, Figure 4] Table 2 and Figure 4: axis labels and legends do not explicitly state the number of runs or random seeds used for averaging attack success rates, which affects interpretation of the reported AUC values.
  3. [§5.2] §5.2 (Heuristic Metrics): the discussion of distance-to-closest record pitfalls would benefit from a direct comparison table showing how this metric correlates (or fails to correlate) with the MIA success rates across the evaluated datasets.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work, recognition of its significance for privacy in generative models for tabular data, and recommendation of minor revision. We appreciate the emphasis on the value of our empirical exploration of relaxed attacker assumptions and the practical guidance regarding heuristic metrics.

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

This manuscript is an empirical study that applies existing membership inference attack techniques to tabular diffusion models and measures leakage under varied attacker knowledge, training setups, and synthesis choices. No mathematical derivations, fitted parameters presented as predictions, self-definitional quantities, or ansatzes appear in the work. Claims rest on experimental results rather than any chain that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical privacy evaluation study; the abstract mentions no free parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5487 in / 1104 out tokens · 74655 ms · 2026-05-11T00:47:27.992118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Optuna: A Next-generation Hyperparameter Optimization Framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework, 2019. URL https://arxiv.org/ abs/1907.10902

  2. [2]

    Alaa, Floris Van Breugel, Evgeny Saveliev, and Mihaela van der Schaar

    Ahmed M. Alaa, Floris Van Breugel, Evgeny Saveliev, and Mihaela van der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 290–306. PMLR, 2022

  3. [3]

    A linear reconstruction approach for attribute inference attacks against synthetic data

    Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, and Luc Rocher. A linear reconstruction approach for attribute inference attacks against synthetic data. InProceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, USA, 2024. USENIX Association. ISBN 978-1-939133-44-1

  4. [4]

    What do you want from theory alone?

    Meenatchi Sundaram Muthu Selva Annamalai, Georgi Ganev, and Emiliano De Cristofaro. "What do you want from theory alone?" Experimenting with tight auditing of differentially private synthetic data generation. InProceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, USA, 2024. USENIX Association

  5. [5]

    Improving question answering model robustness with synthetic adversarial data generation

    Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, and Douwe Kiela. Improving question answering model robustness with synthetic adversarial data generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8830–8848, Online and Punta Cana, Dominican Republic, November 2021. Associat...

  6. [6]

    Why diffusion models don’t memorize: The role of implicit dynamical regularization in training

    Tony Bonnaire, Raphael Urfin, Giulio Biroli, and Marc Mézard. Why diffusion models don’t memorize: The role of implicit dynamical regularization in training. InProceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY , USA,

  7. [7]

    Curran Associates Inc

  8. [8]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. Membership inference attacks from first principles. In2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914, 2022

  9. [9]

    Gan-leaks: A taxonomy of membership inference attacks against generative models

    Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. Gan-leaks: A taxonomy of membership inference attacks against generative models. InProceedings of the 2020 ACM SIGSAC Confer- ence on Computer and Communications Security, CCS ’20, pages 343–362, New York, NY , USA, 2020. Association for Computing Machinery

  10. [10]

    Membership inference over diffusion-models-based synthetic tabular data, 2025

    Peini Cheng and Amir Bahmani. Membership inference over diffusion-models-based synthetic tabular data, 2025. URLhttps://arxiv.org/abs/2510.16037

  11. [11]

    Diabetes 130-US Hospitals for Years 1999-2008

    John Clore, Krzysztof Cios, Jon DeShazo, and Beata Strack. Diabetes 130-US Hospitals for Years 1999-2008. UCI Machine Learning Repository, 2014

  12. [12]

    Dankar and Mahmoud Ibrahim

    Fida K. Dankar and Mahmoud Ibrahim. Fake it till you make it: Guidelines for effective synthetic data generation.Applied Sciences, 11(5), 2021

  13. [13]

    Dankar, Mahmoud K

    Fida K. Dankar, Mahmoud K. Ibrahim, and Leila Ismail. A multi-dimensional evaluation of synthetic data generators.IEEE Access, 10:11147–11158, 2022

  14. [14]

    Differentially Private Diffusion Models.Transactions on Machine Learning Research, 2023

    Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially Private Diffusion Models.Transactions on Machine Learning Research, 2023. URL https://openreview. net/forum?id=ZPpQk7FJXF

  15. [15]

    Improving food safety: Synthetic data augmentation for accurate mushroom species identification in complex environments.Applied Food Research, 5(1):101039, 2025

    Mengze Du, Fei Wang, Weibing Yan, Jie Guo, Lu Liu, Ping Lv, Yong He, Xuping Feng, and Yuwei Wang. Improving food safety: Synthetic data augmentation for accurate mushroom species identification in complex environments.Applied Food Research, 5(1):101039, 2025. 10

  16. [16]

    Are diffusion models vul- nerable to membership inference attacks? InProceedings of the 40th International Conference on Machine Learning, ICML’23

    Jinhao Duan, Fei Kong, Shiqi Wang, Xiaoshuang Shi, and Kaidi Xu. Are diffusion models vul- nerable to membership inference attacks? InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  17. [17]

    Regulation (EU) 2016/679 of the European Parliament and of the Council.OJ L 119, 4.5.2016, p

    European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council.OJ L 119, 4.5.2016, p. 1–88, 2016. URL https://data.europa.eu/eli/reg/2016/679/oj

  18. [18]

    Val Andrei Fajardo, David Findlay, Charu Jaiswal, Xinshang Yin, Roshanak Houmanfar, Honglei Xie, Jiaxi Liang, Xichen She, and D.B. Emerson. On oversampling imbalanced data with deep conditional generative models.Expert Systems with Applications, 169:114463, 2021

  19. [19]

    Sengamedu, and Christos Faloutsos

    Xi Fang, Weijie Xu, Fiona Anting Tan, Ziqing Hu, Jiani Zhang, Yanjun Qi, Srinivasan H. Sengamedu, and Christos Faloutsos. Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=IZnrCGF9WI

  20. [20]

    Understanding and mitigating memorization in diffusion models for tabular data

    Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, and Jing Li. Understanding and mitigating memorization in diffusion models for tabular data. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025

  21. [21]

    MIA-EPT: Membership inference attack via error prediction for tabular data, 2025

    Eyal German, Daniel Samira, Yuval Elovici, and Asaf Shabtai. MIA-EPT: Membership inference attack via error prediction for tabular data, 2025. URL https://arxiv.org/abs/ 2509.13046

  22. [22]

    A survey on privacy preserving synthetic data generation and a discussion on a privacy-utility trade-off problem

    Debolina Ghatak and Kouichi Sakurai. A survey on privacy preserving synthetic data generation and a discussion on a privacy-utility trade-off problem. In Chunhua Su and Kouichi Sakurai, editors,Science of Cyber Security - SciSec 2022 Workshops, pages 167–180, Singapore, 2022. Springer Nature Singapore. ISBN 978-981-19-7769-5

  23. [23]

    A comparative study of open-source libraries for synthetic tabular data generation: SDV vs

    Cristian Del Gobbo. A comparative study of open-source libraries for synthetic tabular data generation: SDV vs. SynthCity, 2025

  24. [24]

    LOGAN: Membership inference attacks against generative models.Proceedings on Privacy Enhancing Technologies, 2019(1):133–152, 2019

    Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. LOGAN: Membership inference attacks against generative models.Proceedings on Privacy Enhancing Technologies, 2019(1):133–152, 2019

  25. [25]

    Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees

    Mikel Hernandez, Pablo A Osorio-Marulanda, Mikel Catalina, Lorea Loinaz, Gorka Epelde, and Naiara Aginako. Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees. Front Digit Health, 7:1576290, 2025

  26. [26]

    Yu, and Xuyun Zhang

    Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. Membership inference attacks on machine learning: A survey.ACM Comput. Surv., 54(11s), September 2022

  27. [27]

    Hen- gartner, and Florian Kerschbaum

    Thomas Humphries, Simon Oya, Lindsey Tulloch, Matthew Rafuse, Ian Goldberg, U. Hen- gartner, and Florian Kerschbaum. Investigating membership inference attacks under data dependencies.2023 IEEE 36th Computer Security Foundations Symposium (CSF), pages 473–488, 2020

  28. [28]

    Data augmentation using synthetic data for time series classification with deep residual networks

    Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre- Alain Muller. Data augmentation using synthetic data for time series classification with deep residual networks. InInternational Workshop on Advanced Analytics and Learning on Temporal Data, ECML PKDD, 2018

  29. [29]

    Evaluating differentially private machine learning in practice

    Bargav Jayaraman and David Evans. Evaluating differentially private machine learning in practice. InProceedings of the 28th USENIX Conference on Security Symposium, SEC’19, pages 1895–1912, USA, 2019. USENIX Association

  30. [30]

    Evans, and Quanquan Gu

    Bargav Jayaraman, Lingxiao Wang, David E. Evans, and Quanquan Gu. Revisiting membership inference under realistic assumptions.Proceedings on Privacy Enhancing Technologies, 2021: 348–368, 2020. 11

  31. [31]

    PANORAMIA: Privacy auditing of machine learning models without retraining

    Mishaal Kazmi, Hadrien Lautraite, Alireza Akbari, Qiaoyue Tang, Mauricio Soroco, Tao Wang, Sébastien Gambs, and Mathias Lécuyer. PANORAMIA: Privacy auditing of machine learning models without retraining. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=5atraF1tbg

  32. [32]

    Tabddpm: Mod- elling tabular data with diffusion models

    Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Mod- elling tabular data with diffusion models. InInternational Conference on Machine Learning, pages 473–488, 2020

  33. [33]

    ensemble-mia.https://github.com/CRCHUM-CITADEL/ensemble-mia, 2025

    Hadrien Lautraite, Lorrie Herbault, , Yue Qi, Jean-François Rajotte, and Sébastien Gambs. ensemble-mia.https://github.com/CRCHUM-CITADEL/ensemble-mia, 2025

  34. [34]

    Lautrup, Tobias Hyrup, Arthur Zimek, and Peter Schneider-Kamp

    Anton D. Lautrup, Tobias Hyrup, Arthur Zimek, and Peter Schneider-Kamp. Syntheval: A framework for detailed utility and privacy evaluation of tabular synthetic data. https: //arxiv.org/abs/2404.15821, 2024

  35. [35]

    Learning differentially private diffusion models via stochastic adversarial distillation

    Bochao Liu, Pengju Wang, and Shiming Ge. Learning differentially private diffusion models via stochastic adversarial distillation. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VII, pages 55–71, Berlin, Heidelberg, 2024. Springer-Verlag

  36. [36]

    Why does differential privacy with large epsilon defend against practical membership inference attacks?ArXiv, abs/2402.09540, 2024

    Andrew Lowy, Zhuohang Li, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, and Ye Wang. Why does differential privacy with large epsilon defend against practical membership inference attacks?ArXiv, abs/2402.09540, 2024

  37. [37]

    Empirical evaluation on synthetic data generation with generative adversarial network

    Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu. Empirical evaluation on synthetic data generation with generative adversarial network. InProceedings of the 9th International Con- ference on Web Intelligence, Mining and Semantics, WIMS2019, New York, NY , USA, 2019. Association for Computing Machinery

  38. [38]

    Efficacy of synthetic data as a benchmark.arXiv preprint arXiv:2409.11968, 2024

    Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark, 2024. URLhttps://arxiv.org/abs/2409.11968

  39. [39]

    Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Del- brouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P

    Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Del- brouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, and Akshay S. Chaudhari. Improving performance, robustness, and fairness of radiographic ai models with finely-controllable synthetic data, 2025. URL https://arxiv.org/abs/2508. 16783

  40. [40]

    Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA)

    Government of Canada. Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA). https://laws-lois.justice.gc.ca/eng/acts/P-8.6/, 2000. URL https://laws-lois.justice.gc.ca/eng/acts/P-8.6/. Statutes of Canada 2000, c. 5

  41. [41]

    ClavaDDPM: multi-relational data synthesis with cluster-guided diffusion models

    Wei Pang, Masoumeh Shafieinejad, Lucy Liu, Stephanie Hazlewood, and Xi He. ClavaDDPM: multi-relational data synthesis with cluster-guided diffusion models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2025. Curran Associates Inc. ISBN 9798331314385

  42. [42]

    Datasynthesizer: Privacy-preserving synthetic datasets

    Haoyue Ping, Julia Stoyanovich, and Bill Howe. Datasynthesizer: Privacy-preserving synthetic datasets. InProceedings of the 29th International Conference on Scientific and Statistical Database Management, SSDBM ’17, New York, NY , USA, 2017. Association for Computing Machinery. ISBN 9781450352826

  43. [43]

    Holdout-based empirical assessment of mixed-type synthetic data.Frontiers in Big Data, 4:679939, 2021

    Michael Platzer and Thomas Reutterer. Holdout-based empirical assessment of mixed-type synthetic data.Frontiers in Big Data, 4:679939, 2021

  44. [44]

    Springer, none edition, June 2024

    Ruben Ruiz-Torrubiano, Gerhard Kormann-Hainzl, and Sarita Paudel.Using Synthetic Data for Improving Robustness and Resilience in ML-Based Smart Services, volume None ofProgress in IS, chapter None, pages 3–13. Springer, none edition, June 2024

  45. [45]

    Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models

    Ahmad Salem, Yang Zhang, Martin Humbert, Michael Fritz, and Manuel Backes. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. InNetwork and Distributed Systems Security Symposium (NDSS) 2019, California, USA, February 24–27 2019. 12

  46. [46]

    SoK: Let the Privacy Games Begin! A Unified Treatment of Data Inference Privacy in Machine Learning

    Ahmed Salem, Giovanni Cherubin, David Evans, Boris Kopf, Andrew Paverd, Anshuman Suri, Shruti Tople, and Santiago Zanella-Beguelin. SoK: Let the Privacy Games Begin! A Unified Treatment of Data Inference Privacy in Machine Learning . In2023 IEEE Symposium on Security and Privacy (SP), pages 327–345, Los Alamitos, CA, USA, May 2023. IEEE Computer Society

  47. [47]

    A decision framework for privacy- preserving synthetic data generation.Computers and Electrical Engineering, 126:110468, 2025

    Pablo Sanchez-Serrano, Ruben Rios, and Isaac Agudo. A decision framework for privacy- preserving synthetic data generation.Computers and Electrical Engineering, 126:110468, 2025

  48. [48]

    Midst challenge at satml 2025: Membership inference over diffusion-models-based synthetic tabular data.arXiv preprint arXiv:2603.19185, 2026

    Masoumeh Shafieinejad, Xi He, Mahshid Alinoori, John Jewell, Sana Ayromlou, Wei Pang, Veronica Chatrath, Gauri Sharma, and Deval Pandya. MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data.arXiv:2603.19185, 2026

  49. [49]

    Tab- Diff: a mixed-type diffusion model for tabular data generation

    Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, and Jure Leskovec. Tab- Diff: a mixed-type diffusion model for tabular data generation. InThe Thirteenth International Conference on Learning Representations, 2025

  50. [50]

    A comprehensive survey of synthetic tabular data generation, 2025

    Ruxue Shi, Yili Wang, Mengnan Du, Xu Shen, Yi Chang, and Xin Wang. A comprehensive survey of synthetic tabular data generation, 2025. URL https://arxiv.org/abs/2504. 16506

  51. [51]

    Membership inference attacks against machine learning models

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 3–18. IEEE Computer Society, 2017

  52. [52]

    Synthetic Data–Anonymisation Groundhog Day

    Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. Synthetic Data–Anonymisation Groundhog Day. In31st USENIX Security Symposium (USENIX Security 22), pages 1451–1468, USA, 2022. USENIX

  53. [53]

    Synthetic data privacy metrics, 2025

    Amy Steier, Lipika Ramaswamy, Andre Manoel, and Alexa Haushalter. Synthetic data privacy metrics, 2025. URLhttps://arxiv.org/abs/2501.03941

  54. [54]

    Stoian, Eleonora Giunchiglia, and Thomas Lukasiewicz

    Mihaela C. Stoian, Eleonora Giunchiglia, and Thomas Lukasiewicz. A survey on deep learning approaches for tabular data generation: Utility, alignment, fidelity, privacy, diversity, and beyond.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https: //openreview.net/forum?id=RoShSRQQ67

  55. [55]

    Congress

    U.S. Congress. Health insurance portability and accountability act of 1996 (hipaa). Public Law 104–191, 110 Stat. 1936, 1996. URL https://www.govinfo.gov/content/pkg/ PLAW-104publ191/pdf/PLAW-104publ191.pdf. Enacted August 21, 1996

  56. [56]

    Membership in- ference attacks against synthetic data through overfitting detection

    Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. Membership in- ference attacks against synthetic data through overfitting detection. In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent, editors,International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain, ...

  57. [57]

    The berka dataset

    Marcelo Ventura. The berka dataset. https://www.kaggle.com/datasets/ marceloventura/the-berka-dataset, 2020. Accessed: 2025-12-10

  58. [58]

    Ensembling membership inference attacks against tabular generative models

    Joshua Ward, Yuxuan Yang, Chi-Hua Wang, and Guang Cheng. Ensembling membership inference attacks against tabular generative models. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, AISec ’25, pages 182–193, New York, NY , USA, 2026. Association for Computing Machinery

  59. [59]

    The application of membership inference in privacy auditing of large language models based on fine-tuning method

    Chengjiang Wen, Yang Yue, and Zhixiang Wang. The application of membership inference in privacy auditing of large language models based on fine-tuning method. InProceedings of the 2025 2nd International Conference on Generative Artificial Intelligence and Information Security, GAIIS ’25, page 473–479, New York, NY , USA, 2025. Association for Computing Ma...

  60. [60]

    Winning the MIDST challenge: New membership inference attacks on diffusion models for tabular data synthesis.arXiv preprint,

    Xiaoyu Wu, Yifei Pang, Terrance Liu, and Steven Wu. Winning the MIDST challenge: New membership inference attacks on diffusion models for tabular data synthesis.arXiv preprint,

  61. [61]

    URLhttps://arxiv.org/abs/2503.12008

  62. [62]

    Generative data augmentation for commonsense reasoning

    Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji- Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. Generative data augmentation for commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online, November

  63. [63]

    Association for Computational Linguistics

  64. [64]

    The DCR delusion: Measuring the privacy risk of synthetic data, 2025

    Zexi Yao, Nataša Krˇco, Georgi Ganev, and Yves-Alexandre de Montjoye. The DCR delusion: Measuring the privacy risk of synthetic data, 2025. URL https://arxiv.org/abs/2505. 01524

  65. [65]

    Synaug: Exploiting synthetic data for data imbalance problems.Pattern Recognition Letters, 193:115–121, 2025

    Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, and Tae-Hyun Oh. Synaug: Exploiting synthetic data for data imbalance problems.Pattern Recognition Letters, 193:115–121, 2025

  66. [66]

    Privacy risk in machine learning: Analyzing the connection to overfitting.2018 IEEE 31st Computer Security Founda- tions Symposium (CSF), pages 268–282, 2017

    Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting.2018 IEEE 31st Computer Security Founda- tions Symposium (CSF), pages 268–282, 2017

  67. [67]

    Low-cost high-power membership inference attacks

    Sajjad Zarifzadeh, Philippe Liu, and Reza Shokri. Low-cost high-power membership inference attacks. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceedings of Machine Learning Resear...

  68. [68]

    Mixed-type tabular data synthesis with score-based diffusion in latent space

    Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with score-based diffusion in latent space. InThe twelfth International Conference on Learning Representations, 2024

  69. [69]

    Pérez, Marten van Dijk, and Lydia Y

    Chaoyi Zhu, Jiayi Tang, Juan F. Pérez, Marten van Dijk, and Lydia Y . Chen. DP-TLDM: Differentially private tabular latent diffusion model, 2025. URL https://arxiv.org/abs/ 2403.07842

  70. [70]

    Yujin Zhu, Zilong Zhao, Robert Birke, and Lydia Y . Chen. Permutation-invariant tabular data synthesis. In2022 IEEE International Conference on Big Data (Big Data), pages 5855–5864,

  71. [71]

    doi: 10.1109/BigData55660.2022.10020639

  72. [72]

    narrow, shallow, short

    Úlfar Erlingsson, Ilya Mironov, Ananth Raghunathan, and Shuang Song. That which we call private, 2020. URLhttps://arxiv.org/abs/1908.03566. A Training and Attacker Configuration Experiment Details In experiments modifying the diffusion model architecture, the changes for each setting are as follows. The “narrow, shallow, short” setting uses a DNN with lay...