pith. sign in

arxiv: 2512.08875 · v2 · submitted 2025-12-09 · 💻 cs.LG · cs.AI

When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

Pith reviewed 2026-05-16 23:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords membership inference attacktabular data generationLLM memorizationsynthetic data privacydigit string leakageno-box attackprivacy leakagedata utility
0
0 comments X

The pith

LLM tabular data generators leak training records through memorized numeric digit strings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models used for synthetic tabular data generation, whether fine-tuned or prompted in context, frequently reproduce specific numeric digit sequences from their training examples. This reproduction enables a no-box membership inference attack called LevAtt that examines only the generated synthetic tables to determine whether particular records appeared in training. The attack demonstrates substantial privacy leakage across multiple models and datasets, reaching perfect classification accuracy in some cases on current state-of-the-art generators. The paper also introduces mitigation approaches, including a sampling method that perturbs digits during generation, which reduces the leakage while preserving most of the synthetic data's fidelity and utility.

Core claim

Popular LLM adaptations for tabular data generation memorize and reproduce string sequences of numeric digits drawn from training observations. This memorization allows a simple attack with access solely to the synthetic outputs to infer training-set membership by matching those digit strings, exposing privacy leakage that can reach perfect accuracy on certain models and datasets.

What carries the argument

LevAtt, a no-box membership inference attack that targets memorized string sequences of numeric digits in synthetic observations to classify training-set membership.

If this is right

  • Both fine-tuning and in-context prompting regimes for LLM tabular generation exhibit the leakage.
  • The attack requires no model weights or training data access, only the synthetic outputs.
  • A digit-perturbation sampling strategy during generation defeats the attack while keeping fidelity and utility losses small.
  • The vulnerability applies across a wide range of models and tabular datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same digit-string leakage may appear in other structured generative tasks that output numeric fields.
  • Synthetic data pipelines for privacy-sensitive domains may need routine checks for digit memorization before release.
  • Future generators could incorporate explicit anti-memorization steps for numeric sequences without major redesign.

Load-bearing premise

The appearance of particular numeric digit strings in generated records reliably indicates that those records were in the training set rather than arising from model generalization or coincidental patterns.

What would settle it

Run the LevAtt attack on synthetic data produced from a training set whose numeric digit strings have been deliberately randomized or replaced with non-memorized alternatives; if attack accuracy remains high, the claim that digit strings indicate membership would be falsified.

Figures

Figures reproduced from arXiv: 2512.08875 by Bochao Gu, Chi-Hua Wang, Guang Cheng, Joshua Ward.

Figure 1
Figure 1. Figure 1: Diagram of Levenshtein Attack. We simply encode rows of tabular data into a string representation from which to [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ROC plot for various No-box MIAs against TabPFN [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation plot for No-box MIA AUC-ROC across [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LevAtt AUC-ROC for various datasets generated [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the transformation function [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the TLP transformation on logit distributions. Before transformation (left), lower logits are tightly concen [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Privacy–fidelity comparison of DM on RealTab [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Privacy–fidelity trade-off of TLP on RealTabFormer synthetic data. We plot the AUC and Maximum Mean Discrepancy [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Utility comparison of XGBoost models trained on real, vanilla synthetic, and TLP-protected synthetic data at various [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLM-based tabular data generators (both fine-tuned small models and prompted large models) leak privacy by reproducing exact numeric digit sequences from their training data. It introduces LevAtt, a simple no-box membership inference attack that flags synthetic rows containing such sequences as training-set members, reports substantial leakage (including perfect classification on some SOTA models) across multiple models and datasets, and proposes two defenses, one of which is a novel digit-perturbation sampling strategy that preserves fidelity.

Significance. If the empirical results hold after the requested controls, the work identifies a concrete and previously under-examined privacy vector in the rapidly adopted setting of LLM tabular synthesis. The no-box threat model and the demonstration that a trivial string-matching rule can serve as a near-perfect classifier on some models are noteworthy; the proposed perturbation defense is a practical contribution that could be adopted quickly.

major comments (3)
  1. [§4 and §5] §4 (Attack Evaluation) and §5 (Results): the claim of perfect or near-perfect classification on SOTA models is not accompanied by per-column entropy statistics, train/test digit-sequence overlap rates, or false-positive rates measured on held-out non-member records. Without these quantities it is impossible to rule out that the observed leakage is inflated by low-entropy numeric fields whose n-grams occur with non-negligible base rate under the learned marginal distribution.
  2. [§3.2] §3.2 (LevAtt Definition): the attack treats exact reproduction of any numeric digit string as a membership signal. The manuscript should report an ablation that varies the minimum string length and the column-selection criterion (e.g., only columns whose empirical entropy exceeds a threshold) to demonstrate that the reported AUCs are not artifacts of including trivially predictable fields such as IDs or ages.
  3. [§6] §6 (Defense Evaluation): the fidelity/utility numbers for the proposed digit-perturbation sampler are given only in aggregate. A per-column breakdown (or at least for the columns that drove the original attack success) is needed to confirm that the defense does not simply trade one form of leakage for another (e.g., by increasing variance in high-entropy columns).
minor comments (2)
  1. [Tables/Figures] Table 1 and Figure 2 captions should explicitly state the number of runs and whether error bars represent standard deviation or standard error.
  2. [§3] The notation for the membership label and the LevAtt decision rule should be introduced once in §3 and used consistently thereafter; currently the same symbol appears with slightly different meanings in the attack pseudocode and the experimental tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The suggested additions of entropy statistics, ablations, and per-column breakdowns will improve the clarity and robustness of our results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Attack Evaluation) and §5 (Results): the claim of perfect or near-perfect classification on SOTA models is not accompanied by per-column entropy statistics, train/test digit-sequence overlap rates, or false-positive rates measured on held-out non-member records. Without these quantities it is impossible to rule out that the observed leakage is inflated by low-entropy numeric fields whose n-grams occur with non-negligible base rate under the learned marginal distribution.

    Authors: We agree that these additional statistics are important to rule out confounding factors. In the revision we will add per-column entropy statistics for all numeric fields, train/test digit-sequence overlap rates, and false-positive rates computed on held-out non-member records. These will be reported in the updated §4 and §5 to demonstrate that the leakage is not driven solely by low-entropy columns. revision: yes

  2. Referee: [§3.2] §3.2 (LevAtt Definition): the attack treats exact reproduction of any numeric digit string as a membership signal. The manuscript should report an ablation that varies the minimum string length and the column-selection criterion (e.g., only columns whose empirical entropy exceeds a threshold) to demonstrate that the reported AUCs are not artifacts of including trivially predictable fields such as IDs or ages.

    Authors: We appreciate the request for an ablation study. We will include a new ablation in the revised §3.2 that varies the minimum string length (e.g., 4, 6, and 8 digits) and restricts columns to those exceeding an entropy threshold. The resulting AUCs will be reported to show that LevAtt remains effective even when low-entropy or trivially predictable columns are excluded. revision: yes

  3. Referee: [§6] §6 (Defense Evaluation): the fidelity/utility numbers for the proposed digit-perturbation sampler are given only in aggregate. A per-column breakdown (or at least for the columns that drove the original attack success) is needed to confirm that the defense does not simply trade one form of leakage for another (e.g., by increasing variance in high-entropy columns).

    Authors: We agree that aggregate metrics alone are insufficient. In the revised §6 we will provide a per-column breakdown of fidelity and utility for the digit-perturbation sampler, with emphasis on the columns that contributed most to attack success. This will confirm that the defense does not increase variance or introduce new issues in high-entropy columns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack defined and evaluated directly on outputs

full rationale

The paper defines LevAtt as a simple string-matching MIA on numeric digit sequences in LLM-generated tabular rows, then measures its success against explicit held-out membership labels across models and datasets. No equations, fitted parameters, or self-citations are used to derive the attack or its performance; success rates are reported as direct experimental outcomes. The central claims rest on falsifiable empirical results rather than any reduction to inputs by construction, self-definition, or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that LLMs memorize numeric digit sequences in tabular data. No free parameters are introduced; the attack uses direct string matching. Axioms are standard assumptions about LLM memorization behavior.

axioms (1)
  • domain assumption LLMs trained or prompted on tabular data can reproduce exact numeric digit sequences from training examples in their outputs.
    Invoked throughout the abstract as the basis for the leakage and attack success.

pith-pipeline@v0.9.0 · 5516 in / 1292 out tokens · 33380 ms · 2026-05-16T23:44:03.591084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 5 internal anchors

  1. [1]

    Rakesh Agrawal and Jerry Kiernan. 2002. Watermarking relational databases. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Morgan Kaufmann, Hong Kong, China, 155–166

  2. [2]

    Abd S Alfagi, A Abd Manaf, B Hamida, S Khan, and Ali A Elrowayati. 2016. Survey on relational database watermarking techniques.ARPN-JEAS11 (2016), 422–423

  3. [3]

    Ankur Ankan and Abinash Panda. 2015. pgmpy: Probabilistic Graphical Models using Python. InProceedings of the Python in Science Conference (SciPy). SciPy, Austin, TX, USA, 6–11. https://doi.org/10.25080/majora-7b98e3ed-001

  4. [4]

    and Dervovic, Danial and Mahfouz, Mahmoud and Tillman, Robert E

    Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. 2021. Generating synthetic data in finance: opportunities, challenges and pitfalls. InProceedings of the First ACM International Conference on AI in Finance(New York, New York)(ICAIF ’20). Association for Computing Machinery, New York, NY, USA, Artic...

  5. [5]

    Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. arXiv:2210.06280 [cs.LG] https://arxiv.org/abs/2210.06280

  6. [6]

    Jessup Byun, Xiaofeng Lin, Joshua Ward, and Guang Cheng. 2025. Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation. arXiv:2507.17066 [cs.LG] https://arxiv.org/abs/2507.17066

  7. [7]

    Terzis, and Florian Tramèr

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, A. Terzis, and Florian Tramèr. 2021. Membership Inference Attacks From First Principles. , 1897- 1914 pages. https://api.semanticscholar.org/CorpusID:244920593

  8. [8]

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. arXiv:2202.07646 [cs.LG] https://arxiv.org/abs/2202.07646

  9. [9]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. arXiv:2012.07805 [cs.CR] https://arxiv.org/abs/2012.07805

  10. [10]

    Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20). ACM, Virtual Event, USA, 343–362. https://doi.org/10.1145/ 3372297.3417238

  11. [11]

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. InForty- second International Conference on Machine Learning, Vol. TBD. PMLR, Vancouver, Canada, XXXX–YYYY. https://openreview.net/forum?id=dYur3yabMj

  12. [12]

    Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. 2019. Neural spline flows. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates Inc., Vancouver, Canada, 7627–7638

  13. [13]

    Sebastian Felix Fischer, Matthias Feurer, and Bernd Bischl. 2023. OpenML-CTR23 – A curated tabular regression benchmarking suite. InAutoML Conference 2023 (Workshop). PMLR, Baltimore, MD, USA. https://openreview.net/forum?id= HebAOoMm94

  14. [14]

    Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learn- ing Analytics.IEEE Access10 (03 2022), 26230–26241. https://doi.org/10.1109/ ACCESS.2022.3156073

  15. [15]

    Joao Fonseca and Fernando Bação. 2023. Tabular and latent space synthetic data generation: a literature review.Journal of Big Data10 (07 2023). https: //doi.org/10.1186/s40537-023-00792-7

  16. [16]

    Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, and Tao Jiang

  17. [17]

    InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24)

    Membership inference attacks against fine-tuned large language models via self-prompt calibration. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 4290, 30 pages

  18. [18]

    Filippo Galli, Luca Melis, and Tommaso Cucinotta. 2024. Noisy Neighbors: Efficient membership inference attacks against LLMs. InProceedings of the Fifth Workshop on Privacy in Natural Language Processing, Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, and Oluwaseyi Feyi...

  19. [19]

    Mauro Giuffré and Dennis L. Shung. 2023. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.NPJ Digital Medicine6 (2023). https://api.semanticscholar.org/CorpusID:263802405

  20. [20]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  21. [21]

    Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models. Proceedings on Privacy Enhancing Technologies2019 (2019), 232 – 249. https: //api.semanticscholar.org/CorpusID:199546273

  22. [22]

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. 2025. Accurate predictions on small data with a tabular foundation model.Nature637, 8045 (2025), 319–326

  23. [23]

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Hoo, Robin Schirrmeister, and Frank Hutter. 2025. Accurate pre- dictions on small data with a tabular foundation model.Nature637 (01 2025), 319–326. https://doi.org/10.1038/s41586-024-08328-6

  24. [24]

    Florimond Houssiau, James Jordon, Samuel N Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch

  25. [25]

    Tapas: a toolbox for adversarial privacy auditing of synthetic data

  26. [26]

    Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagiel- ski, Katherine Lee, Christopher Choquette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. InProceedings of the 16th International Natural Lan- guage Generation Conference, C. Maria Keet, Hung-Yi Le...

  27. [27]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

  28. [28]

    Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models.ArXivabs/2202.06539 (2022). https://api.semanticscholar.org/CorpusID:246823128

  29. [29]

    Jinhee Kim, Taesung Kim, and Jaegul Choo. 2024. EPIC: Effective Prompt- ing for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada. https://openreview.net/forum?id=d5cKDHCrFJ

  30. [30]

    Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko

  31. [31]

    arXiv:2209.15421 [cs.LG]

    TabDDPM: Modelling Tabular Data with Diffusion Models. arXiv:2209.15421 [cs.LG]

  32. [32]

    Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10 (February 1966), 707

  33. [33]

    Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, and Anthony Caterini. 2024. TabPFGen – Tabular Data Generation with TabPFN. arXiv:2406.05216 [cs.LG] https://arxiv.org/abs/2406.05216

  34. [34]

    Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schoelkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. 2023. Membership Inference At- tacks against Language Models via Neighbourhood Comparison. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (Eds.). Associatio...

  35. [35]

    Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. AIM: an adaptive and iterative mechanism for differentially private synthetic data.Proc. VLDB Endow.15, 11 (July 2022), 2599–2612. https://doi.org/10.14778/3551793. 3551817

  36. [36]

    2024.Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing

    Matthieu Meeus, Florent Guepin, Ana-Maria Creţu, and Yves-Alexandre de Montjoye. 2024.Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing. Springer Nature Switzerland, Cham, Switzerland, 380–399. https://doi.org/10.1007/978-3-031-51476-0_19

  37. [37]

    Meta AI. 2024. LLaMA-3.3 70B Instruct Model. https://huggingface.co/meta- llama/Llama-3.3-70B-Instruct. Released December 6, 2024; accessed 2025-06-13

  38. [38]

    Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching.Comput. Surveys33, 1 (2001), 31–88

  39. [39]

    OpenAI. 2024. GPT-4o Mini Model in Chat Completions API. https://platform. openai.com/docs/models/gpt-4o-mini. Released July 18, 2024; accessed 2025-06- 13

  40. [40]

    Michael Platzer and Thomas Reutterer. 2021. Holdout-based empirical assessment of mixed-type synthetic data.Frontiers in big Data4 (2021), 679939

  41. [41]

    Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. https://doi.org/10.48550/ARXIV.2301.07573

  42. [42]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  43. [43]

    2019.Language Models are Unsupervised Multitask Learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI

  44. [44]

    Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar

  45. [45]

    InForty-first International Conference on Machine Learning, Vol

    Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes. InForty-first International Conference on Machine Learning, Vol. 235. PMLR, Vienna, Austria, 44060–44092

  46. [46]

    Igor Shilov, Matthieu Meeus, and Yves-Alexandre de Montjoye. 2025. The Mosaic Memory of Large Language Models. arXiv:2405.15523 [cs.CL] https://arxiv.org/ abs/2405.15523

  47. [47]

    Membership inference attacks against machine learning models

    R. Shokri, M. Stronati, C. Song, and V. Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. In2017 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 3–18. https: //doi.org/10.1109/SP.2017.41

  48. [48]

    Aivin V Solatorio and Olivier Dupriez. 2023. Realtabformer: Generating realistic relational and tabular data using transformers

  49. [49]

    Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. 2022. Synthetic Data – Anonymisation Groundhog Day. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 1451–1468. https: //www.usenix.org/conference/usenixsecurity22/presentation/stadler

  50. [50]

    Namjoon Suh, Xiaofeng Lin, Din-Yin Hsieh, Mehrdad Honarkhah, and Guang Cheng. 2023. AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing. https://openreview.net/forum?id=XhxOCXlXSh

  51. [51]

    Marshall, Severin Elvatun, Helga M.B

    Vibeke Binz Vallevik, Aleksandar Babic, Serena E. Marshall, Severin Elvatun, Helga M.B. Brøgger, Sharmini Alagaratnam, Bjørn Edwin, Narasimha R. Veerara- gavan, Anne Kjersti Befring, and Jan F. Nygård. 2024. Can I trust my fake data – A comprehensive quality assessment framework for synthetic tabular data in healthcare.International Journal of Medical Inf...

  52. [52]

    Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. 2023. Membership Inference Attacks against Synthetic Data through Overfitting De- tection. arXiv:2302.12580 [cs.LG]

  53. [53]

    Yuxin Wang, Duanyu Feng, Yongfu Dai, Zhengyu Chen, Jimin Huang, Sophia Ananiadou, Qianqian Xie, and Hao Wang. 2025. HARMONIC: harnessing LLMs for tabular data synthesis and privacy protection. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

  54. [54]

    Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, and Yanfu Zhang. 2024. Unlocking Memoriza- tion in Large Language Models with Dynamic Soft Prompting. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). ...

  55. [55]

    Joshua Ward, Xiaofeng Lin, , Chi-Hua Wang, and Guang Cheng. 2025. Synth- MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis. arXiv:2509.18014 [cs.CR] https://arxiv.org/abs/2509.18014

  56. [56]

    Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2024. Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models. arXiv:2406.13012 [cs.LG] https://arxiv.org/abs/2406.13012

  57. [57]

    Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2025. Privacy Auditing Syn- thetic Data Release through Local Likelihood Attacks. arXiv:2508.21146 [cs.LG] https://arxiv.org/abs/2508.21146

  58. [58]

    Watson, Kristin Blesch, Jan Kapar, and Marvin N

    David S. Watson, Kristin Blesch, Jan Kapar, and Marvin N. Wright. 2023. Ad- versarial Random Forests for Density Estimation and Generative Modeling. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jen- nifer Dy, and Jan-Willem van de Meent (...

  59. [59]

    Jinhong Wu, Konstantinos Plataniotis, Lucy Liu, Ehsan Amjadian, and Yuri Lawryshyn. 2023. Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data.Algorithms16 (02 2023), 121. https://doi.org/ 10.3390/a16020121

  60. [60]

    Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramacha- neni. 2019. Modeling Tabular data using Conditional GAN. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., Van- couver, Canada, 7335–7345. https://proceedings.neurips.cc/paper/2019/hash/ 254ed7d2de3b23ab10936522dd547b78-Abstract.html

  61. [61]

    Jinsung Yoon, Lydia N Drumright, and Mihaela Van Der Schaar. 2020. Anonymiza- tion through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics24, 8 (2020), 2378–2388

  62. [62]

    Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. InInternational Conference on Learning Representations. OpenReview.net, New Orleans, LA, USA, 1–15. https://openreview.net/forum?id=S1zk9iRqF7

  63. [63]

    Li Yujian and Liu Bo. 2007. A Normalized Levenshtein Distance Metric.IEEE Trans. Pattern Anal. Mach. Intell.29, 6 (June 2007), 1091–1095. https://doi.org/ 10.1109/TPAMI.2007.1078

  64. [64]

    synthetic_data

    Hengrui Zhang, Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, and George Karypis. 2024. Mixed- Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. InThe Twelfth International Conference on Learning Representations. OpenReview.net, Vienna, Austria, 4Ay23yeuz0. https://openreview.n...