When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

Bochao Gu; Chi-Hua Wang; Guang Cheng; Joshua Ward

arxiv: 2512.08875 · v2 · submitted 2025-12-09 · 💻 cs.LG · cs.AI

When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

Joshua Ward , Bochao Gu , Chi-Hua Wang , Guang Cheng This is my paper

Pith reviewed 2026-05-16 23:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords membership inference attacktabular data generationLLM memorizationsynthetic data privacydigit string leakageno-box attackprivacy leakagedata utility

0 comments

The pith

LLM tabular data generators leak training records through memorized numeric digit strings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models used for synthetic tabular data generation, whether fine-tuned or prompted in context, frequently reproduce specific numeric digit sequences from their training examples. This reproduction enables a no-box membership inference attack called LevAtt that examines only the generated synthetic tables to determine whether particular records appeared in training. The attack demonstrates substantial privacy leakage across multiple models and datasets, reaching perfect classification accuracy in some cases on current state-of-the-art generators. The paper also introduces mitigation approaches, including a sampling method that perturbs digits during generation, which reduces the leakage while preserving most of the synthetic data's fidelity and utility.

Core claim

Popular LLM adaptations for tabular data generation memorize and reproduce string sequences of numeric digits drawn from training observations. This memorization allows a simple attack with access solely to the synthetic outputs to infer training-set membership by matching those digit strings, exposing privacy leakage that can reach perfect accuracy on certain models and datasets.

What carries the argument

LevAtt, a no-box membership inference attack that targets memorized string sequences of numeric digits in synthetic observations to classify training-set membership.

If this is right

Both fine-tuning and in-context prompting regimes for LLM tabular generation exhibit the leakage.
The attack requires no model weights or training data access, only the synthetic outputs.
A digit-perturbation sampling strategy during generation defeats the attack while keeping fidelity and utility losses small.
The vulnerability applies across a wide range of models and tabular datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same digit-string leakage may appear in other structured generative tasks that output numeric fields.
Synthetic data pipelines for privacy-sensitive domains may need routine checks for digit memorization before release.
Future generators could incorporate explicit anti-memorization steps for numeric sequences without major redesign.

Load-bearing premise

The appearance of particular numeric digit strings in generated records reliably indicates that those records were in the training set rather than arising from model generalization or coincidental patterns.

What would settle it

Run the LevAtt attack on synthetic data produced from a training set whose numeric digit strings have been deliberately randomized or replaced with non-memorized alternatives; if attack accuracy remains high, the claim that digit strings indicate membership would be falsified.

Figures

Figures reproduced from arXiv: 2512.08875 by Bochao Gu, Chi-Hua Wang, Guang Cheng, Joshua Ward.

**Figure 1.** Figure 1: Diagram of Levenshtein Attack. We simply encode rows of tabular data into a string representation from which to [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: ROC plot for various No-box MIAs against TabPFN [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Correlation plot for No-box MIA AUC-ROC across [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: LevAtt AUC-ROC for various datasets generated [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the transformation function [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of the TLP transformation on logit distributions. Before transformation (left), lower logits are tightly concen [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Privacy–fidelity comparison of DM on RealTab [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Privacy–fidelity trade-off of TLP on RealTabFormer synthetic data. We plot the AUC and Maximum Mean Discrepancy [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Utility comparison of XGBoost models trained on real, vanilla synthetic, and TLP-protected synthetic data at various [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LevAtt shows that numeric digit strings in LLM tabular outputs can leak membership with high accuracy in a no-box setting, but the results may partly reflect low-entropy column patterns rather than pure memorization.

read the letter

The main thing to know is that this paper demonstrates a straightforward no-box membership inference attack called LevAtt that flags training data by looking for exact reproductions of numeric digit sequences in synthetic tabular rows. It works on both fine-tuned smaller models and prompted larger ones, and in some cases reaches perfect classification on the datasets they tested. They also test two defenses, including a sampling tweak that perturbs digits during generation to cut the leakage while trying to preserve utility and fidelity.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLM-based tabular data generators (both fine-tuned small models and prompted large models) leak privacy by reproducing exact numeric digit sequences from their training data. It introduces LevAtt, a simple no-box membership inference attack that flags synthetic rows containing such sequences as training-set members, reports substantial leakage (including perfect classification on some SOTA models) across multiple models and datasets, and proposes two defenses, one of which is a novel digit-perturbation sampling strategy that preserves fidelity.

Significance. If the empirical results hold after the requested controls, the work identifies a concrete and previously under-examined privacy vector in the rapidly adopted setting of LLM tabular synthesis. The no-box threat model and the demonstration that a trivial string-matching rule can serve as a near-perfect classifier on some models are noteworthy; the proposed perturbation defense is a practical contribution that could be adopted quickly.

major comments (3)

[§4 and §5] §4 (Attack Evaluation) and §5 (Results): the claim of perfect or near-perfect classification on SOTA models is not accompanied by per-column entropy statistics, train/test digit-sequence overlap rates, or false-positive rates measured on held-out non-member records. Without these quantities it is impossible to rule out that the observed leakage is inflated by low-entropy numeric fields whose n-grams occur with non-negligible base rate under the learned marginal distribution.
[§3.2] §3.2 (LevAtt Definition): the attack treats exact reproduction of any numeric digit string as a membership signal. The manuscript should report an ablation that varies the minimum string length and the column-selection criterion (e.g., only columns whose empirical entropy exceeds a threshold) to demonstrate that the reported AUCs are not artifacts of including trivially predictable fields such as IDs or ages.
[§6] §6 (Defense Evaluation): the fidelity/utility numbers for the proposed digit-perturbation sampler are given only in aggregate. A per-column breakdown (or at least for the columns that drove the original attack success) is needed to confirm that the defense does not simply trade one form of leakage for another (e.g., by increasing variance in high-entropy columns).

minor comments (2)

[Tables/Figures] Table 1 and Figure 2 captions should explicitly state the number of runs and whether error bars represent standard deviation or standard error.
[§3] The notation for the membership label and the LevAtt decision rule should be introduced once in §3 and used consistently thereafter; currently the same symbol appears with slightly different meanings in the attack pseudocode and the experimental tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The suggested additions of entropy statistics, ablations, and per-column breakdowns will improve the clarity and robustness of our results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4 and §5] §4 (Attack Evaluation) and §5 (Results): the claim of perfect or near-perfect classification on SOTA models is not accompanied by per-column entropy statistics, train/test digit-sequence overlap rates, or false-positive rates measured on held-out non-member records. Without these quantities it is impossible to rule out that the observed leakage is inflated by low-entropy numeric fields whose n-grams occur with non-negligible base rate under the learned marginal distribution.

Authors: We agree that these additional statistics are important to rule out confounding factors. In the revision we will add per-column entropy statistics for all numeric fields, train/test digit-sequence overlap rates, and false-positive rates computed on held-out non-member records. These will be reported in the updated §4 and §5 to demonstrate that the leakage is not driven solely by low-entropy columns. revision: yes
Referee: [§3.2] §3.2 (LevAtt Definition): the attack treats exact reproduction of any numeric digit string as a membership signal. The manuscript should report an ablation that varies the minimum string length and the column-selection criterion (e.g., only columns whose empirical entropy exceeds a threshold) to demonstrate that the reported AUCs are not artifacts of including trivially predictable fields such as IDs or ages.

Authors: We appreciate the request for an ablation study. We will include a new ablation in the revised §3.2 that varies the minimum string length (e.g., 4, 6, and 8 digits) and restricts columns to those exceeding an entropy threshold. The resulting AUCs will be reported to show that LevAtt remains effective even when low-entropy or trivially predictable columns are excluded. revision: yes
Referee: [§6] §6 (Defense Evaluation): the fidelity/utility numbers for the proposed digit-perturbation sampler are given only in aggregate. A per-column breakdown (or at least for the columns that drove the original attack success) is needed to confirm that the defense does not simply trade one form of leakage for another (e.g., by increasing variance in high-entropy columns).

Authors: We agree that aggregate metrics alone are insufficient. In the revised §6 we will provide a per-column breakdown of fidelity and utility for the digit-perturbation sampler, with emphasis on the columns that contributed most to attack success. This will confirm that the defense does not increase variance or introduce new issues in high-entropy columns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack defined and evaluated directly on outputs

full rationale

The paper defines LevAtt as a simple string-matching MIA on numeric digit sequences in LLM-generated tabular rows, then measures its success against explicit held-out membership labels across models and datasets. No equations, fitted parameters, or self-citations are used to derive the attack or its performance; success rates are reported as direct experimental outcomes. The central claims rest on falsifiable empirical results rather than any reduction to inputs by construction, self-definition, or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that LLMs memorize numeric digit sequences in tabular data. No free parameters are introduced; the attack uses direct string matching. Axioms are standard assumptions about LLM memorization behavior.

axioms (1)

domain assumption LLMs trained or prompted on tabular data can reproduce exact numeric digit sequences from training examples in their outputs.
Invoked throughout the abstract as the basis for the leakage and attack success.

pith-pipeline@v0.9.0 · 5516 in / 1292 out tokens · 33380 ms · 2026-05-16T23:44:03.591084+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 5 internal anchors

[1]

Rakesh Agrawal and Jerry Kiernan. 2002. Watermarking relational databases. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Morgan Kaufmann, Hong Kong, China, 155–166

work page 2002
[2]

Abd S Alfagi, A Abd Manaf, B Hamida, S Khan, and Ali A Elrowayati. 2016. Survey on relational database watermarking techniques.ARPN-JEAS11 (2016), 422–423

work page 2016
[3]

Ankur Ankan and Abinash Panda. 2015. pgmpy: Probabilistic Graphical Models using Python. InProceedings of the Python in Science Conference (SciPy). SciPy, Austin, TX, USA, 6–11. https://doi.org/10.25080/majora-7b98e3ed-001

work page doi:10.25080/majora-7b98e3ed-001 2015
[4]

and Dervovic, Danial and Mahfouz, Mahmoud and Tillman, Robert E

Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. 2021. Generating synthetic data in finance: opportunities, challenges and pitfalls. InProceedings of the First ACM International Conference on AI in Finance(New York, New York)(ICAIF ’20). Association for Computing Machinery, New York, NY, USA, Artic...

work page doi:10.1145/3383455.3422554 2021
[5]

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. arXiv:2210.06280 [cs.LG] https://arxiv.org/abs/2210.06280

work page arXiv 2023
[6]

Jessup Byun, Xiaofeng Lin, Joshua Ward, and Guang Cheng. 2025. Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation. arXiv:2507.17066 [cs.LG] https://arxiv.org/abs/2507.17066

work page arXiv 2025
[7]

Terzis, and Florian Tramèr

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, A. Terzis, and Florian Tramèr. 2021. Membership Inference Attacks From First Principles. , 1897- 1914 pages. https://api.semanticscholar.org/CorpusID:244920593

work page 2021
[8]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. arXiv:2202.07646 [cs.LG] https://arxiv.org/abs/2202.07646

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. arXiv:2012.07805 [cs.CR] https://arxiv.org/abs/2012.07805

work page arXiv 2021
[10]

Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20). ACM, Virtual Event, USA, 343–362. https://doi.org/10.1145/ 3372297.3417238

work page arXiv 2020
[11]

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. InForty- second International Conference on Machine Learning, Vol. TBD. PMLR, Vancouver, Canada, XXXX–YYYY. https://openreview.net/forum?id=dYur3yabMj

work page 2025
[12]

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. 2019. Neural spline flows. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates Inc., Vancouver, Canada, 7627–7638

work page 2019
[13]

Sebastian Felix Fischer, Matthias Feurer, and Bernd Bischl. 2023. OpenML-CTR23 – A curated tabular regression benchmarking suite. InAutoML Conference 2023 (Workshop). PMLR, Baltimore, MD, USA. https://openreview.net/forum?id= HebAOoMm94

work page 2023
[14]

Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learn- ing Analytics.IEEE Access10 (03 2022), 26230–26241. https://doi.org/10.1109/ ACCESS.2022.3156073

work page arXiv 2022
[15]

Joao Fonseca and Fernando Bação. 2023. Tabular and latent space synthetic data generation: a literature review.Journal of Big Data10 (07 2023). https: //doi.org/10.1186/s40537-023-00792-7

work page doi:10.1186/s40537-023-00792-7 2023
[16]

Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, and Tao Jiang

work page
[17]

InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24)

Membership inference attacks against fine-tuned large language models via self-prompt calibration. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 4290, 30 pages

work page
[18]

Filippo Galli, Luca Melis, and Tommaso Cucinotta. 2024. Noisy Neighbors: Efficient membership inference attacks against LLMs. InProceedings of the Fifth Workshop on Privacy in Natural Language Processing, Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, and Oluwaseyi Feyi...

work page 2024
[19]

Mauro Giuffré and Dennis L. Shung. 2023. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.NPJ Digital Medicine6 (2023). https://api.semanticscholar.org/CorpusID:263802405

work page 2023
[20]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models. Proceedings on Privacy Enhancing Technologies2019 (2019), 232 – 249. https: //api.semanticscholar.org/CorpusID:199546273

work page 2019
[22]

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. 2025. Accurate predictions on small data with a tabular foundation model.Nature637, 8045 (2025), 319–326

work page 2025
[23]

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Hoo, Robin Schirrmeister, and Frank Hutter. 2025. Accurate pre- dictions on small data with a tabular foundation model.Nature637 (01 2025), 319–326. https://doi.org/10.1038/s41586-024-08328-6

work page doi:10.1038/s41586-024-08328-6 2025
[24]

Florimond Houssiau, James Jordon, Samuel N Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch

work page
[25]

Tapas: a toolbox for adversarial privacy auditing of synthetic data

work page
[26]

Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagiel- ski, Katherine Lee, Christopher Choquette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. InProceedings of the 16th International Natural Lan- guage Generation Conference, C. Maria Keet, Hung-Yi Le...

work page doi:10.18653/v1/2023.inlg-main.3 2023
[27]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models.ArXivabs/2202.06539 (2022). https://api.semanticscholar.org/CorpusID:246823128

work page arXiv 2022
[29]

Jinhee Kim, Taesung Kim, and Jaegul Choo. 2024. EPIC: Effective Prompt- ing for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada. https://openreview.net/forum?id=d5cKDHCrFJ

work page 2024
[30]

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko

work page
[31]

arXiv:2209.15421 [cs.LG]

TabDDPM: Modelling Tabular Data with Diffusion Models. arXiv:2209.15421 [cs.LG]

work page arXiv
[32]

Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10 (February 1966), 707

work page 1966
[33]

Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, and Anthony Caterini. 2024. TabPFGen – Tabular Data Generation with TabPFN. arXiv:2406.05216 [cs.LG] https://arxiv.org/abs/2406.05216

work page arXiv 2024
[34]

Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schoelkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. 2023. Membership Inference At- tacks against Language Models via Neighbourhood Comparison. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (Eds.). Associatio...

work page doi:10.18653/v1/2023.findings-acl.719 2023
[35]

Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. AIM: an adaptive and iterative mechanism for differentially private synthetic data.Proc. VLDB Endow.15, 11 (July 2022), 2599–2612. https://doi.org/10.14778/3551793. 3551817

work page doi:10.14778/3551793 2022
[36]

2024.Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing

Matthieu Meeus, Florent Guepin, Ana-Maria Creţu, and Yves-Alexandre de Montjoye. 2024.Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing. Springer Nature Switzerland, Cham, Switzerland, 380–399. https://doi.org/10.1007/978-3-031-51476-0_19

work page doi:10.1007/978-3-031-51476-0_19 2024
[37]

Meta AI. 2024. LLaMA-3.3 70B Instruct Model. https://huggingface.co/meta- llama/Llama-3.3-70B-Instruct. Released December 6, 2024; accessed 2025-06-13

work page 2024
[38]

Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching.Comput. Surveys33, 1 (2001), 31–88

work page 2001
[39]

OpenAI. 2024. GPT-4o Mini Model in Chat Completions API. https://platform. openai.com/docs/models/gpt-4o-mini. Released July 18, 2024; accessed 2025-06- 13

work page 2024
[40]

Michael Platzer and Thomas Reutterer. 2021. Holdout-based empirical assessment of mixed-type synthetic data.Frontiers in big Data4 (2021), 679939

work page 2021
[41]

Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. https://doi.org/10.48550/ARXIV.2301.07573

work page doi:10.48550/arxiv.2301.07573 2023
[42]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

2019.Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI

work page 2019
[44]

Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar

work page
[45]

InForty-first International Conference on Machine Learning, Vol

Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes. InForty-first International Conference on Machine Learning, Vol. 235. PMLR, Vienna, Austria, 44060–44092

work page
[46]

Igor Shilov, Matthieu Meeus, and Yves-Alexandre de Montjoye. 2025. The Mosaic Memory of Large Language Models. arXiv:2405.15523 [cs.CL] https://arxiv.org/ abs/2405.15523

work page arXiv 2025
[47]

Membership inference attacks against machine learning models

R. Shokri, M. Stronati, C. Song, and V. Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. In2017 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 3–18. https: //doi.org/10.1109/SP.2017.41

work page doi:10.1109/sp.2017.41 2017
[48]

Aivin V Solatorio and Olivier Dupriez. 2023. Realtabformer: Generating realistic relational and tabular data using transformers

work page 2023
[49]

Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. 2022. Synthetic Data – Anonymisation Groundhog Day. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 1451–1468. https: //www.usenix.org/conference/usenixsecurity22/presentation/stadler

work page 2022
[50]

Namjoon Suh, Xiaofeng Lin, Din-Yin Hsieh, Mehrdad Honarkhah, and Guang Cheng. 2023. AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing. https://openreview.net/forum?id=XhxOCXlXSh

work page 2023
[51]

Marshall, Severin Elvatun, Helga M.B

Vibeke Binz Vallevik, Aleksandar Babic, Serena E. Marshall, Severin Elvatun, Helga M.B. Brøgger, Sharmini Alagaratnam, Bjørn Edwin, Narasimha R. Veerara- gavan, Anne Kjersti Befring, and Jan F. Nygård. 2024. Can I trust my fake data – A comprehensive quality assessment framework for synthetic tabular data in healthcare.International Journal of Medical Inf...

work page doi:10.1016/j.ijmedinf.2024.105413 2024
[52]

Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. 2023. Membership Inference Attacks against Synthetic Data through Overfitting De- tection. arXiv:2302.12580 [cs.LG]

work page arXiv 2023
[53]

Yuxin Wang, Duanyu Feng, Yongfu Dai, Zhengyu Chen, Jimin Huang, Sophia Ananiadou, Qianqian Xie, and Hao Wang. 2025. HARMONIC: harnessing LLMs for tabular data synthesis and privacy protection. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

work page 2025
[54]

Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, and Yanfu Zhang. 2024. Unlocking Memoriza- tion in Large Language Models with Dynamic Soft Prompting. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). ...

work page doi:10.18653/v1/2024.emnlp-main.546 2024
[55]

Joshua Ward, Xiaofeng Lin, , Chi-Hua Wang, and Guang Cheng. 2025. Synth- MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis. arXiv:2509.18014 [cs.CR] https://arxiv.org/abs/2509.18014

work page arXiv 2025
[56]

Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2024. Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models. arXiv:2406.13012 [cs.LG] https://arxiv.org/abs/2406.13012

work page arXiv 2024
[57]

Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2025. Privacy Auditing Syn- thetic Data Release through Local Likelihood Attacks. arXiv:2508.21146 [cs.LG] https://arxiv.org/abs/2508.21146

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Watson, Kristin Blesch, Jan Kapar, and Marvin N

David S. Watson, Kristin Blesch, Jan Kapar, and Marvin N. Wright. 2023. Ad- versarial Random Forests for Density Estimation and Generative Modeling. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jen- nifer Dy, and Jan-Willem van de Meent (...

work page 2023
[59]

Jinhong Wu, Konstantinos Plataniotis, Lucy Liu, Ehsan Amjadian, and Yuri Lawryshyn. 2023. Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data.Algorithms16 (02 2023), 121. https://doi.org/ 10.3390/a16020121

work page doi:10.3390/a16020121 2023
[60]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramacha- neni. 2019. Modeling Tabular data using Conditional GAN. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., Van- couver, Canada, 7335–7345. https://proceedings.neurips.cc/paper/2019/hash/ 254ed7d2de3b23ab10936522dd547b78-Abstract.html

work page 2019
[61]

Jinsung Yoon, Lydia N Drumright, and Mihaela Van Der Schaar. 2020. Anonymiza- tion through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics24, 8 (2020), 2378–2388

work page 2020
[62]

Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. InInternational Conference on Learning Representations. OpenReview.net, New Orleans, LA, USA, 1–15. https://openreview.net/forum?id=S1zk9iRqF7

work page 2019
[63]

Li Yujian and Liu Bo. 2007. A Normalized Levenshtein Distance Metric.IEEE Trans. Pattern Anal. Mach. Intell.29, 6 (June 2007), 1091–1095. https://doi.org/ 10.1109/TPAMI.2007.1078

work page doi:10.1109/tpami.2007.1078 2007
[64]

synthetic_data

Hengrui Zhang, Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, and George Karypis. 2024. Mixed- Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. InThe Twelfth International Conference on Learning Representations. OpenReview.net, Vienna, Austria, 4Ay23yeuz0. https://openreview.n...

work page 2024

[1] [1]

Rakesh Agrawal and Jerry Kiernan. 2002. Watermarking relational databases. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Morgan Kaufmann, Hong Kong, China, 155–166

work page 2002

[2] [2]

Abd S Alfagi, A Abd Manaf, B Hamida, S Khan, and Ali A Elrowayati. 2016. Survey on relational database watermarking techniques.ARPN-JEAS11 (2016), 422–423

work page 2016

[3] [3]

Ankur Ankan and Abinash Panda. 2015. pgmpy: Probabilistic Graphical Models using Python. InProceedings of the Python in Science Conference (SciPy). SciPy, Austin, TX, USA, 6–11. https://doi.org/10.25080/majora-7b98e3ed-001

work page doi:10.25080/majora-7b98e3ed-001 2015

[4] [4]

and Dervovic, Danial and Mahfouz, Mahmoud and Tillman, Robert E

Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. 2021. Generating synthetic data in finance: opportunities, challenges and pitfalls. InProceedings of the First ACM International Conference on AI in Finance(New York, New York)(ICAIF ’20). Association for Computing Machinery, New York, NY, USA, Artic...

work page doi:10.1145/3383455.3422554 2021

[5] [5]

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. arXiv:2210.06280 [cs.LG] https://arxiv.org/abs/2210.06280

work page arXiv 2023

[6] [6]

Jessup Byun, Xiaofeng Lin, Joshua Ward, and Guang Cheng. 2025. Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation. arXiv:2507.17066 [cs.LG] https://arxiv.org/abs/2507.17066

work page arXiv 2025

[7] [7]

Terzis, and Florian Tramèr

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, A. Terzis, and Florian Tramèr. 2021. Membership Inference Attacks From First Principles. , 1897- 1914 pages. https://api.semanticscholar.org/CorpusID:244920593

work page 2021

[8] [8]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. arXiv:2202.07646 [cs.LG] https://arxiv.org/abs/2202.07646

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. arXiv:2012.07805 [cs.CR] https://arxiv.org/abs/2012.07805

work page arXiv 2021

[10] [10]

Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20). ACM, Virtual Event, USA, 343–362. https://doi.org/10.1145/ 3372297.3417238

work page arXiv 2020

[11] [11]

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. InForty- second International Conference on Machine Learning, Vol. TBD. PMLR, Vancouver, Canada, XXXX–YYYY. https://openreview.net/forum?id=dYur3yabMj

work page 2025

[12] [12]

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. 2019. Neural spline flows. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates Inc., Vancouver, Canada, 7627–7638

work page 2019

[13] [13]

Sebastian Felix Fischer, Matthias Feurer, and Bernd Bischl. 2023. OpenML-CTR23 – A curated tabular regression benchmarking suite. InAutoML Conference 2023 (Workshop). PMLR, Baltimore, MD, USA. https://openreview.net/forum?id= HebAOoMm94

work page 2023

[14] [14]

Brendan Flanagan, Rwitajit Majumdar, and Hiroaki Ogata. 2022. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learn- ing Analytics.IEEE Access10 (03 2022), 26230–26241. https://doi.org/10.1109/ ACCESS.2022.3156073

work page arXiv 2022

[15] [15]

Joao Fonseca and Fernando Bação. 2023. Tabular and latent space synthetic data generation: a literature review.Journal of Big Data10 (07 2023). https: //doi.org/10.1186/s40537-023-00792-7

work page doi:10.1186/s40537-023-00792-7 2023

[16] [16]

Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, and Tao Jiang

work page

[17] [17]

InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24)

Membership inference attacks against fine-tuned large language models via self-prompt calibration. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 4290, 30 pages

work page

[18] [18]

Filippo Galli, Luca Melis, and Tommaso Cucinotta. 2024. Noisy Neighbors: Efficient membership inference attacks against LLMs. InProceedings of the Fifth Workshop on Privacy in Natural Language Processing, Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, and Oluwaseyi Feyi...

work page 2024

[19] [19]

Mauro Giuffré and Dennis L. Shung. 2023. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.NPJ Digital Medicine6 (2023). https://api.semanticscholar.org/CorpusID:263802405

work page 2023

[20] [20]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. 2019. Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models. Proceedings on Privacy Enhancing Technologies2019 (2019), 232 – 249. https: //api.semanticscholar.org/CorpusID:199546273

work page 2019

[22] [22]

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. 2025. Accurate predictions on small data with a tabular foundation model.Nature637, 8045 (2025), 319–326

work page 2025

[23] [23]

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Hoo, Robin Schirrmeister, and Frank Hutter. 2025. Accurate pre- dictions on small data with a tabular foundation model.Nature637 (01 2025), 319–326. https://doi.org/10.1038/s41586-024-08328-6

work page doi:10.1038/s41586-024-08328-6 2025

[24] [24]

Florimond Houssiau, James Jordon, Samuel N Cohen, Owen Daniel, Andrew Elliott, James Geddes, Callum Mole, Camila Rangel-Smith, and Lukasz Szpruch

work page

[25] [25]

Tapas: a toolbox for adversarial privacy auditing of synthetic data

work page

[26] [26]

Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagiel- ski, Katherine Lee, Christopher Choquette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. InProceedings of the 16th International Natural Lan- guage Generation Conference, C. Maria Keet, Hung-Yi Le...

work page doi:10.18653/v1/2023.inlg-main.3 2023

[27] [27]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models.ArXivabs/2202.06539 (2022). https://api.semanticscholar.org/CorpusID:246823128

work page arXiv 2022

[29] [29]

Jinhee Kim, Taesung Kim, and Jaegul Choo. 2024. EPIC: Effective Prompt- ing for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada. https://openreview.net/forum?id=d5cKDHCrFJ

work page 2024

[30] [30]

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko

work page

[31] [31]

arXiv:2209.15421 [cs.LG]

TabDDPM: Modelling Tabular Data with Diffusion Models. arXiv:2209.15421 [cs.LG]

work page arXiv

[32] [32]

Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10 (February 1966), 707

work page 1966

[33] [33]

Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, and Anthony Caterini. 2024. TabPFGen – Tabular Data Generation with TabPFN. arXiv:2406.05216 [cs.LG] https://arxiv.org/abs/2406.05216

work page arXiv 2024

[34] [34]

Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schoelkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. 2023. Membership Inference At- tacks against Language Models via Neighbourhood Comparison. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (Eds.). Associatio...

work page doi:10.18653/v1/2023.findings-acl.719 2023

[35] [35]

Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. AIM: an adaptive and iterative mechanism for differentially private synthetic data.Proc. VLDB Endow.15, 11 (July 2022), 2599–2612. https://doi.org/10.14778/3551793. 3551817

work page doi:10.14778/3551793 2022

[36] [36]

2024.Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing

Matthieu Meeus, Florent Guepin, Ana-Maria Creţu, and Yves-Alexandre de Montjoye. 2024.Achilles’ Heels: Vulnerable Record Identification in Synthetic Data Publishing. Springer Nature Switzerland, Cham, Switzerland, 380–399. https://doi.org/10.1007/978-3-031-51476-0_19

work page doi:10.1007/978-3-031-51476-0_19 2024

[37] [37]

Meta AI. 2024. LLaMA-3.3 70B Instruct Model. https://huggingface.co/meta- llama/Llama-3.3-70B-Instruct. Released December 6, 2024; accessed 2025-06-13

work page 2024

[38] [38]

Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching.Comput. Surveys33, 1 (2001), 31–88

work page 2001

[39] [39]

OpenAI. 2024. GPT-4o Mini Model in Chat Completions API. https://platform. openai.com/docs/models/gpt-4o-mini. Released July 18, 2024; accessed 2025-06- 13

work page 2024

[40] [40]

Michael Platzer and Thomas Reutterer. 2021. Holdout-based empirical assessment of mixed-type synthetic data.Frontiers in big Data4 (2021), 679939

work page 2021

[41] [41]

Zhaozhi Qian, Bogdan-Constantin Cebere, and Mihaela van der Schaar. 2023. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. https://doi.org/10.48550/ARXIV.2301.07573

work page doi:10.48550/arxiv.2301.07573 2023

[42] [42]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

2019.Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI

work page 2019

[44] [44]

Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar

work page

[45] [45]

InForty-first International Conference on Machine Learning, Vol

Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes. InForty-first International Conference on Machine Learning, Vol. 235. PMLR, Vienna, Austria, 44060–44092

work page

[46] [46]

Igor Shilov, Matthieu Meeus, and Yves-Alexandre de Montjoye. 2025. The Mosaic Memory of Large Language Models. arXiv:2405.15523 [cs.CL] https://arxiv.org/ abs/2405.15523

work page arXiv 2025

[47] [47]

Membership inference attacks against machine learning models

R. Shokri, M. Stronati, C. Song, and V. Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. In2017 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 3–18. https: //doi.org/10.1109/SP.2017.41

work page doi:10.1109/sp.2017.41 2017

[48] [48]

Aivin V Solatorio and Olivier Dupriez. 2023. Realtabformer: Generating realistic relational and tabular data using transformers

work page 2023

[49] [49]

Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. 2022. Synthetic Data – Anonymisation Groundhog Day. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 1451–1468. https: //www.usenix.org/conference/usenixsecurity22/presentation/stadler

work page 2022

[50] [50]

Namjoon Suh, Xiaofeng Lin, Din-Yin Hsieh, Mehrdad Honarkhah, and Guang Cheng. 2023. AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing. https://openreview.net/forum?id=XhxOCXlXSh

work page 2023

[51] [51]

Marshall, Severin Elvatun, Helga M.B

Vibeke Binz Vallevik, Aleksandar Babic, Serena E. Marshall, Severin Elvatun, Helga M.B. Brøgger, Sharmini Alagaratnam, Bjørn Edwin, Narasimha R. Veerara- gavan, Anne Kjersti Befring, and Jan F. Nygård. 2024. Can I trust my fake data – A comprehensive quality assessment framework for synthetic tabular data in healthcare.International Journal of Medical Inf...

work page doi:10.1016/j.ijmedinf.2024.105413 2024

[52] [52]

Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. 2023. Membership Inference Attacks against Synthetic Data through Overfitting De- tection. arXiv:2302.12580 [cs.LG]

work page arXiv 2023

[53] [53]

Yuxin Wang, Duanyu Feng, Yongfu Dai, Zhengyu Chen, Jimin Huang, Sophia Ananiadou, Qianqian Xie, and Hao Wang. 2025. HARMONIC: harnessing LLMs for tabular data synthesis and privacy protection. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

work page 2025

[54] [54]

Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, and Yanfu Zhang. 2024. Unlocking Memoriza- tion in Large Language Models with Dynamic Soft Prompting. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). ...

work page doi:10.18653/v1/2024.emnlp-main.546 2024

[55] [55]

Joshua Ward, Xiaofeng Lin, , Chi-Hua Wang, and Guang Cheng. 2025. Synth- MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis. arXiv:2509.18014 [cs.CR] https://arxiv.org/abs/2509.18014

work page arXiv 2025

[56] [56]

Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2024. Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models. arXiv:2406.13012 [cs.LG] https://arxiv.org/abs/2406.13012

work page arXiv 2024

[57] [57]

Joshua Ward, Chi-Hua Wang, and Guang Cheng. 2025. Privacy Auditing Syn- thetic Data Release through Local Likelihood Attacks. arXiv:2508.21146 [cs.LG] https://arxiv.org/abs/2508.21146

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Watson, Kristin Blesch, Jan Kapar, and Marvin N

David S. Watson, Kristin Blesch, Jan Kapar, and Marvin N. Wright. 2023. Ad- versarial Random Forests for Density Estimation and Generative Modeling. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jen- nifer Dy, and Jan-Willem van de Meent (...

work page 2023

[59] [59]

Jinhong Wu, Konstantinos Plataniotis, Lucy Liu, Ehsan Amjadian, and Yuri Lawryshyn. 2023. Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data.Algorithms16 (02 2023), 121. https://doi.org/ 10.3390/a16020121

work page doi:10.3390/a16020121 2023

[60] [60]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramacha- neni. 2019. Modeling Tabular data using Conditional GAN. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., Van- couver, Canada, 7335–7345. https://proceedings.neurips.cc/paper/2019/hash/ 254ed7d2de3b23ab10936522dd547b78-Abstract.html

work page 2019

[61] [61]

Jinsung Yoon, Lydia N Drumright, and Mihaela Van Der Schaar. 2020. Anonymiza- tion through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics24, 8 (2020), 2378–2388

work page 2020

[62] [62]

Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. InInternational Conference on Learning Representations. OpenReview.net, New Orleans, LA, USA, 1–15. https://openreview.net/forum?id=S1zk9iRqF7

work page 2019

[63] [63]

Li Yujian and Liu Bo. 2007. A Normalized Levenshtein Distance Metric.IEEE Trans. Pattern Anal. Mach. Intell.29, 6 (June 2007), 1091–1095. https://doi.org/ 10.1109/TPAMI.2007.1078

work page doi:10.1109/tpami.2007.1078 2007

[64] [64]

synthetic_data

Hengrui Zhang, Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, and George Karypis. 2024. Mixed- Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. InThe Twelfth International Conference on Learning Representations. OpenReview.net, Vienna, Austria, 4Ay23yeuz0. https://openreview.n...

work page 2024