pith. sign in

arxiv: 2606.08372 · v1 · pith:Y7QUQ7EXnew · submitted 2026-06-06 · 💻 cs.CR · cs.LG

SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

Pith reviewed 2026-06-27 19:11 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords reconstruction attackssynthetic tabular dataattribute inferencedifferential privacysynthetic data generationmemorization testNIST CRCprivacy evaluation
0
0 comments X

The pith

The choice of synthetic data generator determines reconstruction risk far more than the choice of attack method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper organizes reconstruction attacks on synthetic tabular data into a taxonomy based on the structural information each attack exploits. It runs the largest head-to-head evaluation to date, pitting fourteen attacks against nine generators across five standard datasets, and introduces a memorization test plus a common scale that lets reconstruction success be compared directly with membership inference. The central result is that generator choice explains far more of the observed risk than attack choice; differential privacy adds protection mainly when the privacy budget is small, after which risk plateaus at a level set by the generator itself. De-identified releases are the most exposed, and nearly all successful reconstructions recover population-level patterns rather than individual training records, with risk concentrated on atypical entries.

Core claim

Reconstruction attacks recover hidden attribute values for individuals given a synthetic table and a few known quasi-identifiers. The work shows that differences among the nine synthetic data generators account for substantially more variance in attack success than differences among the fourteen attacks; differential privacy reduces risk only for budgets ε ≲ 1 and then plateaus; de-identification methods without synthesis are the most vulnerable; and a new test reveals that most reconstruction success reflects learning the underlying distribution rather than memorizing specific training rows.

What carries the argument

A taxonomy of reconstruction attacks classified by the structural cues they use, paired with a memorization test that separates distributional reconstruction from training-record memorization and a reduction that places reconstruction and membership inference on one numeric scale.

If this is right

  • Efforts to reduce reconstruction risk should prioritize selection or improvement of the synthetic data generator rather than selection of a particular attack defense.
  • Differential privacy supplies meaningful protection only when the privacy budget remains small; larger budgets add little beyond the generator's native capacity.
  • Plain de-identification without any synthesis step leaves releases most exposed to attribute recovery.
  • Risk is concentrated on atypical records, so defenses that specifically limit leakage on outliers would address most of the measured harm.
  • Most measured reconstruction reflects learning of the data distribution, so techniques that improve distributional fidelity without increasing memorization can lower risk.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployments may need to benchmark several candidate generators on their own data before release rather than relying on published average-case rankings.
  • Generator design could target reduced leakage specifically on records that deviate from the learned distribution.
  • The same generator-dominance pattern may appear when the taxonomy is extended to other data types such as graphs or sequences.
  • Organizations could adopt the memorization test as a routine pre-release check to decide whether a given synthetic release is safe enough for their risk tolerance.

Load-bearing premise

The five benchmark datasets and nine SDG methods are representative enough that the observed dominance of generator choice over attack choice will generalize to other real-world tabular releases and generators.

What would settle it

A new tabular dataset and tenth generator in which swapping the attack changes success rates by a larger margin than swapping the generator would falsify the dominance claim.

Figures

Figures reproduced from arXiv: 2606.08372 by Martine De Cock, Sikha Pentyala, Steven Golob.

Figure 1
Figure 1. Figure 1: The reconstruction threat model (synthetic-data [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of reconstruction attacks. We study the two lower branches (reconstruction from de-identified microdata [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise ensemble reconstruction accuracy ( [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a Systematization of Knowledge on reconstruction (attribute inference) attacks against synthetic tabular data. It contributes a taxonomy of attacks by exploited structure, the largest empirical study to date (14 attacks vs. 9 SDG methods on 5 benchmark datasets), new attacks including CoBP-RA as the strongest measured, and two interpretive methodologies (a memorization test distinguishing population distribution learning from training-record memorization, plus a reduction placing reconstruction and membership inference on a common scale). The central empirical claims are that SDG method choice dominates risk over attack choice, differential privacy is protective mainly for ε≲1 and plateaus thereafter (bounded by synthesizer capacity), de-identification methods are most exposed, and most observed reconstruction reflects distributional structure rather than memorization (concentrating risk on atypical records). All findings are externally validated by the authors' first-place finish among red teams in the 2025 NIST Collaborative Research Cycle.

Significance. If the comparative rankings and DP-plateau observations hold under the reported conditions, the work supplies actionable guidance for synthetic data release: practitioners should prioritize generator selection and recognize the limited marginal value of DP beyond small budgets. The scale of the evaluation, the NIST competition validation, and the explicit memorization test are concrete strengths that elevate the contribution beyond prior scattered studies.

major comments (2)
  1. [Abstract / empirical evaluation] Abstract and empirical evaluation section: the central claim that 'the choice of SDG method governs risk far more than the choice of attack' rests on variance comparisons across the nine generators and five datasets. The manuscript does not report statistical significance tests, error bars, or explicit data-exclusion rules, leaving open the possibility that the observed dominance is sensitive to post-hoc selection or to shared properties of the chosen benchmarks (low dimensionality, particular correlation structures).
  2. [Findings / DP analysis] Findings paragraph: the claim that DP protection 'plateaus, bounded by the synthesizer's capacity rather than its noise' for ε≳1 is load-bearing for the practical takeaway. Without a sensitivity analysis that varies synthesizer capacity independently of the privacy mechanism, or explicit controls for the interaction between ε and generator architecture, it is unclear whether the plateau is an artifact of the nine chosen methods.
minor comments (2)
  1. [Taxonomy section] The taxonomy and attack descriptions would benefit from a single summary table listing each attack, the structure it exploits, and whether it is new or adapted from prior work.
  2. [Figures] Figure captions for the attack-success heatmaps should explicitly state the metric (e.g., precision@K, AUC) and whether results are averaged over multiple random seeds or data splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and commitments to revision.

read point-by-point responses
  1. Referee: [Abstract / empirical evaluation] Abstract and empirical evaluation section: the central claim that 'the choice of SDG method governs risk far more than the choice of attack' rests on variance comparisons across the nine generators and five datasets. The manuscript does not report statistical significance tests, error bars, or explicit data-exclusion rules, leaving open the possibility that the observed dominance is sensitive to post-hoc selection or to shared properties of the chosen benchmarks (low dimensionality, particular correlation structures).

    Authors: We agree that the absence of formal statistical tests and error bars leaves the variance comparison open to the concerns raised. The dominance pattern is consistent across all five datasets and fourteen attacks and receives external corroboration from our first-place NIST CRC result, but this does not substitute for quantitative support. In revision we will add error bars (multiple seeds), report variance decomposition (e.g., ANOVA or eta-squared) between generator and attack factors, and state explicit data-exclusion rules. These additions will be included without changing the reported ordering or conclusions. revision: yes

  2. Referee: [Findings / DP analysis] Findings paragraph: the claim that DP protection 'plateaus, bounded by the synthesizer's capacity rather than its noise' for ε≳1 is load-bearing for the practical takeaway. Without a sensitivity analysis that varies synthesizer capacity independently of the privacy mechanism, or explicit controls for the interaction between ε and generator architecture, it is unclear whether the plateau is an artifact of the nine chosen methods.

    Authors: The plateau is observed uniformly across the DP variants among the nine generators, which already span different architectures and capacities. We nevertheless accept that an explicit sensitivity analysis isolating capacity would strengthen the claim. In the revision we will add a dedicated paragraph discussing capacity–ε interactions, reference any available external evidence on capacity limits, and note the limitation that a fully orthogonal capacity sweep was outside the scope of the current study. The core observation will be qualified accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and cross-method comparisons

full rationale

The paper's load-bearing claims (SDG method dominates attack choice; DP plateaus above ε≲1 bounded by synthesizer capacity) are direct outputs of pitting 14 attacks against 9 generators on 5 datasets, with external validation via first-place NIST CRC finish. No equations or results reduce by construction to fitted parameters, self-definitions, or self-citation chains; the taxonomy, new attacks, and memorization test are independent contributions whose success is measured against held-out competition outcomes rather than internal fits. Representativeness concerns affect generalization but do not create circularity in the reported derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical systematization of knowledge with no new mathematical derivations, free parameters, or postulated entities; all claims rest on experimental comparisons and external competition results.

pith-pipeline@v0.9.1-grok · 5822 in / 1265 out tokens · 24107 ms · 2026-06-27T19:11:23.769926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 18 canonical work pages

  1. [1]

    John M Abowd. 2018. The U.S. Census Bureau Adopts Differential Privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2867

  2. [2]

    John M Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ash- win Machanavajjhala, Brett Moran, William Sexton, Matthew Spence, and Pavel Zhuravlev. 2022. The 2020 Census Disclosure Avoidance System TopDown Al- gorithm.Harvard Data Science Review(2022). doi:10.1162/9960...

  3. [3]

    Tristan Allard, Louis Béziaud, and Sébastien Gambs. 2023. SNAKE challenge: Sanitization algorithms under attack. InProceedings of the 32nd ACM international conference on information and knowledge management. 5010–5014

  4. [4]

    Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, and Luc Rocher

  5. [5]

    In33rd USENIX security symposium (USENIX security 24)

    A linear reconstruction approach for attribute inference attacks against synthetic data. In33rd USENIX security symposium (USENIX security 24). 2351– 2368

  6. [6]

    Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E

    Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. 2020. Generating Synthetic Data in Finance: 13 Under Review Golob et al. Opportunities, Challenges and Pitfalls. InProceedings of the First ACM Interna- tional Conference on AI in Finance (ICAIF). ACM. doi:10.1145/3383455.3422554

  7. [7]

    Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. 2019. Differential privacy has disparate impact on model accuracy.Advances in neural information processing systems32 (2019)

  8. [8]

    Barry Becker and Ronny Kohavi. 1996. Adult Data Set. https://archive.ics.uci. edu/ml/datasets/adult. doi:10.24432/C5XW20 UCI Machine Learning Repository

  9. [9]

    Stanley, and Evan Totty

    Gary Benedetto, Jordan C. Stanley, and Evan Totty. 2018.The Creation and Use of the SIPP Synthetic Beta v7.0. Working Paper. U.S. Census Bureau

  10. [10]

    Leo Breiman. 2001. Random Forests.Machine Learning45, 1 (2001), 5–32

  11. [11]

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. 2022. Membership inference attacks from first principles. In2022 IEEE symposium on security and privacy (SP). IEEE, 1897–1914

  12. [12]

    Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extract- ing Training Data from Diffusion Models. In32nd USENIX Security Symposium (USENIX Security 23). 5253–5270

  13. [13]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Rafique. 2021. Extracting Training Data from Large Language Models. In30th USENIX Security Symposium (USENIX Security 21). 2633–2650

  14. [14]

    Centers for Disease Control and Prevention. 2015. CDC Diabetes Health In- dicators Dataset. https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+ indicators. Behavioral Risk Factor Surveillance System (BRFSS)

  15. [15]

    Chen Chen, Lingjuan Lyu, Han Yu, and Gang Chen. 2024. Practical Attribute Reconstruction Attack Against Federated Learning.IEEE Transactions on Big Data10, 6 (2024), 851–863. doi:10.1109/TBDATA.2022.3159236

  16. [16]

    Edith Cohen, Haim Kaplan, Yishay Mansour, Shay Moran, Kobbi Nissim, Uri Stemmer, and Eliad Tsfadia. 2024. Data reconstruction: When you see it and when you don’t.arXiv preprint arXiv:2405.15753(2024)

  17. [17]

    Hakan Demirtas. 2018. Flexible imputation of missing data.Journal of statistical software85 (2018), 1–5

  18. [18]

    Travis Dick, Cynthia Dwork, Michael Kearns, Terrance Liu, Aaron Roth, Giuseppe Vietri, and Zhiwei Steven Wu. 2023. Confidence-ranked reconstruction of census microdata from published statistics.Proceedings of the National Academy of Sciences120, 8 (2023), e2218605120

  19. [19]

    Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. InProceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART sym- posium on Principles of database systems. 202–210

  20. [20]

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Cali- brating Noise to Sensitivity in Private Data Analysis. InProceedings of the Third Conference on Theory of Cryptography (TCC). Springer, 265–284

  21. [21]

    Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science9, 3–4 (2014), 211–407

  22. [22]

    Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. 2017. Ex- posed! a survey of attacks on private data.Annual Review of Statistics and Its Application4, 1 (2017), 61–84

  23. [23]

    Khaled El Emam, Elizabeth Jonker, Luk Arbuckle, and Bradley Malin. 2011. A systematic review of re-identification attacks on health data.PloS One6, 12 (2011), e28071

  24. [24]

    Khaled El Emam, Lucy Mosquera, and Jason Bass. 2020. Evaluating Identity Dis- closure Risk in Fully Synthetic Health Data: Model Development and Validation. Journal of Medical Internet Research22, 11 (2020), e23139. doi:10.2196/23139

  25. [25]

    Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 1322–1333

  26. [26]

    On the (In)Security of LLM app stores,

    Georgi Ganev and Emiliano De Cristofaro. 2025. The Inadequacy of Similarity- Based Privacy Metrics: Privacy Attacks Against “Truly Anonymous” Synthetic Datasets. In2025 IEEE Symposium on Security and Privacy (SP). 4007–4025. doi:10. 1109/SP61157.2025.00218 ISSN: 2375-1207

  27. [27]

    Simson L Garfinkel, John M Abowd, and Christian Martindale. 2019. Understand- ing Database Reconstruction Attacks on Public Data.Commun. ACM62, 3 (2019), 46–53

  28. [28]

    Matteo Giomi, Franziska Boenisch, Christoph Wehmeyer, and Borbála Tasnádi

  29. [29]

    Proceedings on Privacy Enhancing Technologies(2023)

    A Unified Framework for Quantifying Privacy Risk in Synthetic Data. Proceedings on Privacy Enhancing Technologies(2023)

  30. [30]

    Steven Golob, Sikha Pentyala, Anuar Maratkhan, and Martine De Cock. 2025. Privacy Vulnerabilities in Marginals-based Synthetic Data. In2025 IEEE Confer- ence on Secure and Trustworthy Machine Learning (SaTML). 977–995. doi:10.1109/ SaTML64287.2025.00059

  31. [31]

    Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2019. LOGAN: Membership inference attacks against generative models.Proceedings on Privacy Enhancing Technologies1 (2019), 133–152

  32. [32]

    Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte carlo and reconstruction membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies2019, 4 (2019), 232–249

  33. [33]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems, Vol. 33. 6840–6851

  34. [34]

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. InInternational Conference on Learning Representations (ICLR)

  35. [35]

    Florimond Houssiau, James Jordon, Samuel N Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch

  36. [36]

    TAPAS: a toolbox for adversarial privacy auditing of synthetic data.arXiv preprint arXiv:2211.06550(2022)

  37. [37]

    Yuzheng Hu, Fan Wu, Qinbin Li, Yunhui Long, Gonzalo Garrido, Chang Ge, Bolin Ding, David Forsyth, Bo Li, and Dawn Song. 2024. SoK: privacy-preserving data synthesis. In2024 IEEE symposium on security and privacy (SP). 4696–4713

  38. [38]

    Bargav Jayaraman and David Evans. 2022. Are attribute inference attacks just imputation?. InProceedings of the 2022 ACM SIGSAC conference on computer and communications security. 1569–1582

  39. [39]

    James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. 2022. Synthetic Data–what, why and how?arXiv preprint arXiv:2205.03257(2022)

  40. [40]

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems30 (2017)

  41. [41]

    Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. Tabddpm: Modelling tabular data with diffusion models. InInternational confer- ence on machine learning. PMLR, 17564–17579

  42. [42]

    Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. 2018. PacGAN: The power of two samples in generative adversarial networks. InAdvances in Neural Information Processing Systems, Vol. 31

  43. [43]

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion proba- bilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11461–11471

  44. [44]

    Lingjuan Lyu and Chen Chen. 2021. A Novel Attribute Reconstruction Attack in Federated Learning.arXiv preprint arXiv:2108.06910(2021)

  45. [45]

    Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakr- ishnan Venkitasubramaniam. 2007. ℓ-Diversity: Privacy Beyond 𝑘-Anonymity. ACM Transactions on Knowledge Discovery from Data1, 1 (2007), 3

  46. [46]

    Ryan McKenna, Gerome Miklau, and Daniel Sheldon. 2021. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. arXiv preprint arXiv:2108.04978(2021)

  47. [47]

    Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. Aim: An adaptive and iterative mechanism for differentially private synthetic data. arXiv preprint arXiv:2201.12677(2022)

  48. [48]

    Ryan McKenna, Daniel Sheldon, and Gerome Miklau. 2019. Graphical-model based estimation and inference for differential privacy. InInternational conference on machine learning. PMLR, 4435–4444

  49. [49]

    Matthieu Meeus, Florent Guepin, Ana-Maria Creţu, and Yves-Alexandre de Mon- tjoye. 2023. Achilles’ heels: vulnerable record identification in synthetic data publishing. InEuropean symposium on research in computer security. Springer, 380–399

  50. [50]

    Richard A. Moore. 1996.Controlled data-swapping techniques for masking pub- lic use microdata sets. Technical Report RR 96/04. U.S. Bureau of the Census, Statistical Research Division

  51. [51]

    Krishnamurty Muralidhar and Josep Domingo-Ferrer. 2023. Database recon- struction is not so easy and is different from reidentification.Journal of Official Statistics39, 3 (2023), 381–398. Publisher: SAGE Publications Sage UK: London, England

  52. [52]

    Arvind Narayanan and Vitaly Shmatikov. 2008. Robust De-Anonymization of Large Sparse Datasets. In2008 IEEE Symposium on Security and Privacy (SP). IEEE, 111–125

  53. [53]

    National Institute of Standards and Technology. 2018. NIST Differential Privacy Synthetic Data Challenge (Arizona Dataset). https://www.nist.gov/ctl/pscr/open- innovation-prize-challenges. Synthetic census-like dataset

  54. [54]

    National Institute of Standards and Technology. 2018. NIST Synthetic Data Challenge: Survey of Business Owners (SBO). https://www.nist.gov/ctl/pscr/ open-innovation-prize-challenges. Synthetic business transaction dataset

  55. [55]

    Beata Nowok, Gillian M Raab, and Chris Dibben. 2016. synthpop: Bespoke creation of synthetic data in R.Journal of statistical software74 (2016), 1–26

  56. [56]

    Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification.Machine Learning85, 3 (2011), 333–359. doi:10.1007/s10994-011-5256-5

  57. [57]

    Luc Rocher, Julien M Hendrickx, and Yves-Alexandre De Montjoye. 2019. Esti- mating the success of re-identifications in incomplete datasets using generative models.Nature Communications10, 1 (2019), 3069

  58. [58]

    Pierangela Samarati. 2001. Protecting Respondents’ Identities in Microdata Release.IEEE Transactions on Knowledge and Data Engineering13, 6 (2001), 1010–1027. doi:10.1109/69.971193 14 Reconstruction Attaxonomy Under Review )

  59. [59]

    scikit-learn developers. 2024. California Housing Dataset. https://scikit-learn.org/ stable/modules/generated/sklearn.datasets.fetch_california_housing.html. De- rived from 1990 US Census data

  60. [60]

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem- bership inference attacks against machine learning models. In2017 IEEE Sympo- sium on Security and Privacy (SP). IEEE, 3–18

  61. [61]

    Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. 2022. Synthetic data–anonymisation groundhog day. In31st USENIX security symposium (USENIX security 22). 1451–1468

  62. [62]

    Daniel J Stekhoven and Peter Bühlmann. 2012. MissForest—non-parametric missing value imputation for mixed-type data.Bioinformatics28, 1 (2012), 112– 118

  63. [63]

    Jonathan AC Sterne, Ian R White, John B Carlin, Michael Spratt, Patrick Royston, Michael G Kenward, Angela M Wood, and James R Carpenter. 2009. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.Bmj338 (2009)

  64. [64]

    2000.Simple Demographics Often Identify People Uniquely

    Latanya Sweeney. 2000.Simple Demographics Often Identify People Uniquely. Technical Report LIDAP-WP4. Carnegie Mellon University, Data Privacy Labora- tory

  65. [65]

    Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy.Interna- tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems(2002)

  66. [66]

    Matthias Templ, Alexander Kowarik, and Bernhard Meindl. 2015. Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.Journal of Statistical Software67, 4 (2015), 1–36

  67. [67]

    Department of Health and Human Services, Office for Civil Rights

    U.S. Department of Health and Human Services, Office for Civil Rights. 2012. Guidance Regarding Methods for De-identification of Protected Health Informa- tion in Accordance with the HIPAA Privacy Rule. https://www.hhs.gov/hipaa/for- professionals/special-topics/de-identification/index.html

  68. [68]

    Boris Van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. 2023. Membership inference attacks against synthetic data through overfitting detec- tion.arXiv preprint arXiv:2302.12580(2023)

  69. [69]

    Stef Van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate imputation by chained equations in R.Journal of statistical software45 (2011), 1–67

  70. [70]

    Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. 2018. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association25...

  71. [71]

    David S Watson, Kristin Blesch, Jan Kapar, and Marvin N Wright. 2023. Ad- versarial random forests for density estimation and generative modeling. In International conference on artificial intelligence and statistics. PMLR, 5357–5375

  72. [72]

    2001.Elements of Statistical Disclosure Control

    Leon Willenborg and Ton de Waal. 2001.Elements of Statistical Disclosure Control. Springer, New York

  73. [73]

    Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni

  74. [74]

    Modeling tabular data using conditional gan.Advances in neural information processing systems32 (2019)

  75. [75]

    M Yaghini, B Kulynych, G Cherubin, M Veale, and C Troncoso. 2022. Disparate vulnerability to membership inference attacks.Proceedings on Privacy Enhancing Technologies2022, 1 (2022), 460–480

  76. [76]

    Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. In2018 IEEE 31st Computer Security Foundations Symposium (CSF). IEEE, 268–282

  77. [77]

    Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. Gain: Missing data impu- tation using generative adversarial nets. InInternational conference on machine learning. PMLR, 5689–5698

  78. [78]

    Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xi- aokui Xiao. 2017. PrivBayes: Private data release via Bayesian networks.ACM Transactions on Database Systems (TODS)42, 4 (2017), 1–41

  79. [79]

    adversar- ial

    Benjamin Zi Hao Zhao, Aviral Agrawal, Catisha Coburn, Hassan Jameel Asghar, Raghav Bhaskar, Mohamed Ali Kaafar, Darren Webb, and Peter Dickinson. 2021. On the (in) feasibility of attribute inference attacks on machine learning models. In2021 IEEE european symposium on security and privacy (EuroS&P). IEEE, 232– 251. A Preliminaries A.1 Dataset Details The ...

  80. [80]

    + demo”) also appear in QIbeh.; similarly, the six “ + large

    CoBP-RA wraps the same25-tree forest with belief propagation over an MST of pairwise synthetic marginals (local PMI from the 100nearest synthetic neighbours, Laplace smoothing 𝛼=10−6); its continuous variant quantile-bins each hidden feature into20bins and switches to global PMI. The diffusion attacks (CondDDPM, Con- dRePaint, CondDDPMWithMLP) share one M...