Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

Chi-Hua Wang; Guang Cheng; Joshua Ward

arxiv: 2508.21146 · v2 · submitted 2025-08-28 · 💻 cs.LG · stat.ML

Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

Joshua Ward , Chi-Hua Wang , Guang Cheng This is my paper

Pith reviewed 2026-05-18 20:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords privacy auditingsynthetic datamembership inferencegenerative modelstabular datalocal overfittinglikelihood rationo-box attack

0 comments

The pith

A no-box membership inference attack uses local overfitting in tabular generative models to audit privacy leakage in synthetic data releases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Generative Likelihood Ratio Attack (Gen-LRA), a computationally efficient membership inference method that requires no model access or knowledge. It works by measuring the influence of a test observation on a surrogate model's estimate of a local likelihood ratio computed over the released synthetic data. The authors derive a closed-form characterization of the attack score as a localized density-ratio statistic and prove that, under a general model of local overfitting, it produces a detectable mean-score gap between training-set members and non-members. This framework supplies both a practical auditing tool and concrete predictions for when the attack succeeds.

Core claim

The Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic. Under a general model of local overfitting, the attack produces a provable mean-score gap between members and non-members of the training data, which yields testable predictions for attack success across datasets, model architectures, and parameters.

What carries the argument

The Gen-LRA score, which quantifies the influence of a test point on a surrogate model's local likelihood-ratio estimate over the synthetic data.

If this is right

Gen-LRA consistently outperforms competing membership inference attacks across diverse tabular datasets, generative architectures, and attack parameters, especially at low false-positive rates.
The attack requires no model knowledge or access, so it applies directly to black-box synthetic data releases.
The theoretical framework supplies predictions for attack success that can be validated by measuring the degree of local overfitting in a given generator.
Real-world synthetic data releases carry measurable privacy risks when generative models exhibit regional overfitting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If local overfitting is the main source of leakage, then adding region-specific regularization to generators could shrink the mean-score gap and lower attack success.
The localized density-ratio construction might extend to auditing privacy leakage in non-tabular data modalities where generative models still overfit locally.
Testing Gen-LRA on generators that explicitly penalize regional memorization would provide a direct check of whether the predicted score gap disappears.

Load-bearing premise

Tabular generative models tend to significantly overfit to certain regions of the training distribution.

What would settle it

A controlled experiment in which a generative model is regularized to remove local overfitting regions, after which the mean Gen-LRA score gap between members and non-members vanishes and the attack performs no better than random guessing.

Figures

Figures reproduced from arXiv: 2508.21146 by Chi-Hua Wang, Guang Cheng, Joshua Ward.

**Figure 2.** Figure 2: Average Wasserstein Distance and Average Maximum Mean Discrepancy plotted against [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison of the Mean AUC-ROC for DOMIAS and Gen-LRA using density estimation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Auditing the privacy leakage of synthetic data is an important but unresolved problem. Existing privacy auditing frameworks for synthetic data rely on heuristics and unrealistic assumptions about model access, offering limited ability to describe or detect the privacy exposure of training data through synthetic data release. In this paper, we study designing membership inference attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. We propose \emph{Generative Likelihood Ratio Attack} (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has on a surrogate model's estimate of a local likelihood ratio over the synthetic data. We develop a theoretical framework for the attack: we show that the Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic, and we prove that under a general model of local overfitting it produces a provable mean-score gap between members and non-members, yielding testable predictions for when the attack should succeed. We validate these predictions in a controlled simulation study and assess Gen-LRA against a comprehensive benchmark spanning diverse datasets, generative model architectures, and attack parameters. Across metrics, Gen-LRA consistently dominates competing MIAs, with especially strong gains at low false positive rates. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, and highlight the significant privacy risks posed by generative model overfitting in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Generative Likelihood Ratio Attack (Gen-LRA), a no-box membership inference attack for auditing privacy leakage in synthetic tabular data. It claims a closed-form characterization of the Gen-LRA score as a localized density-ratio statistic and proves a mean-score gap between members and non-members under a parameterized local overfitting model of generative processes. The work validates the predictions via controlled simulations and reports superior performance over competing MIAs across diverse datasets, architectures, and metrics, with particular gains at low false-positive rates.

Significance. If the local-overfitting model holds for real tabular generators, the closed-form derivation and provable gap provide a principled, efficient auditing tool without model access, along with testable predictions for attack success. The simulation validation inside the assumed regime and the empirical dominance in benchmarks are notable strengths that could inform privacy standards for synthetic data release.

major comments (2)

[Theoretical framework] Theoretical framework (around the mean-score gap derivation): The proof relies on a general model of local overfitting that assigns higher density mass to training points in localized regions. The manuscript does not independently quantify the degree or locality of overfitting present in the benchmarked models (e.g., diffusion or flow-based generators with regularization), so it remains unclear whether the provable gap applies outside the controlled simulation regime.
[Benchmark experiments] Benchmark experiments section: The reported dominance of Gen-LRA is load-bearing for the practical claim, yet the evaluation does not include an ablation or measurement isolating the contribution of the local-overfitting assumption versus the underlying density-ratio estimator; without this, the results do not confirm that the theoretical gap drives the observed improvements.

minor comments (2)

Ensure all equations in the closed-form characterization and proof are numbered and explicitly referenced in the text for traceability.
Clarify the precise construction and hyperparameter choices for the surrogate model used to estimate the local likelihood ratio.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and describe the revisions we will make to strengthen the connection between theory and experiments.

read point-by-point responses

Referee: [Theoretical framework] Theoretical framework (around the mean-score gap derivation): The proof relies on a general model of local overfitting that assigns higher density mass to training points in localized regions. The manuscript does not independently quantify the degree or locality of overfitting present in the benchmarked models (e.g., diffusion or flow-based generators with regularization), so it remains unclear whether the provable gap applies outside the controlled simulation regime.

Authors: We agree that an independent quantification of local overfitting in the benchmarked models would help clarify the applicability of the mean-score gap beyond the simulation regime. In the revised manuscript we will add a dedicated analysis subsection that reports simple, model-agnostic indicators of local overfitting (e.g., the ratio of local density mass assigned to training versus held-out points within small neighborhoods of the synthetic data) for each generative model and dataset used in the benchmarks. These measurements will be presented alongside the existing attack results to allow readers to assess how well the modeling assumptions align with the observed behavior of real tabular generators. revision: yes
Referee: [Benchmark experiments] Benchmark experiments section: The reported dominance of Gen-LRA is load-bearing for the practical claim, yet the evaluation does not include an ablation or measurement isolating the contribution of the local-overfitting assumption versus the underlying density-ratio estimator; without this, the results do not confirm that the theoretical gap drives the observed improvements.

Authors: We acknowledge that an explicit ablation isolating the local-overfitting component would strengthen the empirical claims. Because the closed-form Gen-LRA statistic is derived directly from the local likelihood ratio, a perfect separation is not straightforward; however, we will add two targeted comparisons in the revised experiments: (1) a non-local baseline that replaces the localized density-ratio estimator with its global counterpart, and (2) a sweep over the locality radius parameter that controls how narrowly the attack focuses on local regions. These additions will provide evidence on the incremental benefit attributable to the local-overfitting modeling assumption while remaining within the scope of the existing evaluation framework. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under explicit assumptions

full rationale

The paper derives a closed-form characterization of the Gen-LRA score as a localized density-ratio statistic and proves a mean-score gap under a general parameterized model of local overfitting. This constitutes a theoretical derivation from stated premises rather than any reduction of the claimed result to its own fitted inputs, self-citations, or definitions by construction. No load-bearing steps match the enumerated circularity patterns; the simulation validates predictions inside the assumed regime without the central claim collapsing to a tautology via the paper's equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's central claim depends on one domain assumption about local overfitting behavior of tabular generators; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Tabular generative models significantly overfit to certain regions of the training distribution.
Invoked in abstract paragraph 2 as the observation that enables the attack; the provable gap is shown only under a general model of this overfitting.

pith-pipeline@v0.9.0 · 5802 in / 1302 out tokens · 50755 ms · 2026-05-18T20:14:37.665572+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that the Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic, and we prove that under a general model of local overfitting it produces a provable mean-score gap
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

tabular generative models tend to significantly overfit to certain regions of the training distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation
cs.LG 2025-12 conditional novelty 7.0

LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper

[1]

truly anonymous synthetic data

URL https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-def initions-part-ii/. 9 Mostly AI. How to implement data privacy? a conversation with klaudius kalcher, 2021. URLhttps: //mostly.ai/data-democratization-podcast/how-to-implement-data-privacy/ . Ankur Ankan and Abinash Panda. pgmpy: Probabilistic graphical models using python. In Proceed- ing...

work page doi:10.25080/majora-7b98e3ed-001 2021
[2]

Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu

doi: 10.1109/EuroSP48549.2020.00040. Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu. Empirical evaluation on synthetic data generation with generative adversarial network. In Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics, WIMS2019, New York, NY , USA, 2019. Association for Computing Machinery. ISBN 9781450361903....

work page doi:10.1109/eurosp48549.2020.00040 2020
[3]

URL https://arxiv.org/abs/2406.13012. David S. Watson, Kristin Blesch, Jan Kapar, and Marvin N. Wright. Adversarial random forests for density estimation and generative modeling. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Pr...

work page arXiv 2023
[4]

URL https://doi.org/10.1051/itmconf/2018 2300037

doi: 10.1051/itmconf/20182300037. URL https://doi.org/10.1051/itmconf/2018 2300037. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. In Neural Information Processing Systems , 2019. URL https: //api.semanticscholar.org/CorpusID:195767064. Andrew Yale, Saloni Dash, Ritik Dutta, Isabe...

work page doi:10.1051/itmconf/20182300037 2018
[5]

Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar

URL https://openreview.net/forum?id=S1zk9iRqF7. Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8):2378—2388, August 2020a. ISSN 2168-2194. doi: 10.1109/jbhi.2020.2980262. URL https://doi.org/10.1109/jbhi...

work page doi:10.1109/jbhi.2020.2980262 2020
[6]

Abalone (OpenML): https://www.openml.org/search?type=data&sort=runs&id =183&status=active

work page
[7]

Adult Becker and Kohavi [1996]

work page 1996
[8]

Bean (UCI): https://archive.ics.uci.edu/dataset/602/dry+bean+dataset

work page
[9]

Churn-Modeling (Kaggle): https://www.kaggle.com/datasets/shrutimechlear n/churn-modelling

work page
[10]

Faults (UCI): https://archive.ics.uci.edu/dataset/198/steel+plates+fault s

work page
[11]

HTRU (UCI): https://archive.ics.uci.edu/dataset/372/htru2

work page
[12]

Indian Liver Patient (Kaggle): https://www.kaggle.com/datasets/uciml/indian -liver-patient-records?resource=download

work page
[13]

Insurance (Kaggle): https://www.kaggle.com/datasets/mirichoi0218/insuran ce

work page
[14]

Magic (Kaggle): https://www.kaggle.com/datasets/abhinand05/magic-gamma -telescope-dataset?resource=download

work page
[15]

News (UCI): https://archive.ics.uci.edu/dataset/332/online+news+popula rity

work page
[16]

Nursery (Kaggle): https://www.kaggle.com/datasets/heitornunes/nursery

work page
[17]

Obesity (Kaggle): https://www.kaggle.com/datasets/tathagatbanerjee/obesi ty-dataset-uci-ml 16

work page
[18]

Shoppers (Kaggle): https://www.kaggle.com/datasets/henrysue/online-shopp ers-intention

work page
[19]

Titanic (Kaggle): https://www.kaggle.com/c/titanic/data

work page
[20]

Wilt (OpenML): https://www.openml.org/search?type=data&sort=runs&id=4 0983&status=active D Additional Results D.1 Gen-LRA Encoding As our main experiment uses Kernel Density Estimation (KDE) over (usually) heterogeneous datasets, we present an ablation for encoding tabular data to be numeric such that KDE can converge. We experiment with 3 common strategi...

work page 2017

[1] [1]

truly anonymous synthetic data

URL https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-def initions-part-ii/. 9 Mostly AI. How to implement data privacy? a conversation with klaudius kalcher, 2021. URLhttps: //mostly.ai/data-democratization-podcast/how-to-implement-data-privacy/ . Ankur Ankan and Abinash Panda. pgmpy: Probabilistic graphical models using python. In Proceed- ing...

work page doi:10.25080/majora-7b98e3ed-001 2021

[2] [2]

Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu

doi: 10.1109/EuroSP48549.2020.00040. Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu. Empirical evaluation on synthetic data generation with generative adversarial network. In Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics, WIMS2019, New York, NY , USA, 2019. Association for Computing Machinery. ISBN 9781450361903....

work page doi:10.1109/eurosp48549.2020.00040 2020

[3] [3]

URL https://arxiv.org/abs/2406.13012. David S. Watson, Kristin Blesch, Jan Kapar, and Marvin N. Wright. Adversarial random forests for density estimation and generative modeling. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Pr...

work page arXiv 2023

[4] [4]

URL https://doi.org/10.1051/itmconf/2018 2300037

doi: 10.1051/itmconf/20182300037. URL https://doi.org/10.1051/itmconf/2018 2300037. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. In Neural Information Processing Systems , 2019. URL https: //api.semanticscholar.org/CorpusID:195767064. Andrew Yale, Saloni Dash, Ritik Dutta, Isabe...

work page doi:10.1051/itmconf/20182300037 2018

[5] [5]

Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar

URL https://openreview.net/forum?id=S1zk9iRqF7. Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8):2378—2388, August 2020a. ISSN 2168-2194. doi: 10.1109/jbhi.2020.2980262. URL https://doi.org/10.1109/jbhi...

work page doi:10.1109/jbhi.2020.2980262 2020

[6] [6]

Abalone (OpenML): https://www.openml.org/search?type=data&sort=runs&id =183&status=active

work page

[7] [7]

Adult Becker and Kohavi [1996]

work page 1996

[8] [8]

Bean (UCI): https://archive.ics.uci.edu/dataset/602/dry+bean+dataset

work page

[9] [9]

Churn-Modeling (Kaggle): https://www.kaggle.com/datasets/shrutimechlear n/churn-modelling

work page

[10] [10]

Faults (UCI): https://archive.ics.uci.edu/dataset/198/steel+plates+fault s

work page

[11] [11]

HTRU (UCI): https://archive.ics.uci.edu/dataset/372/htru2

work page

[12] [12]

Indian Liver Patient (Kaggle): https://www.kaggle.com/datasets/uciml/indian -liver-patient-records?resource=download

work page

[13] [13]

Insurance (Kaggle): https://www.kaggle.com/datasets/mirichoi0218/insuran ce

work page

[14] [14]

Magic (Kaggle): https://www.kaggle.com/datasets/abhinand05/magic-gamma -telescope-dataset?resource=download

work page

[15] [15]

News (UCI): https://archive.ics.uci.edu/dataset/332/online+news+popula rity

work page

[16] [16]

Nursery (Kaggle): https://www.kaggle.com/datasets/heitornunes/nursery

work page

[17] [17]

Obesity (Kaggle): https://www.kaggle.com/datasets/tathagatbanerjee/obesi ty-dataset-uci-ml 16

work page

[18] [18]

Shoppers (Kaggle): https://www.kaggle.com/datasets/henrysue/online-shopp ers-intention

work page

[19] [19]

Titanic (Kaggle): https://www.kaggle.com/c/titanic/data

work page

[20] [20]

Wilt (OpenML): https://www.openml.org/search?type=data&sort=runs&id=4 0983&status=active D Additional Results D.1 Gen-LRA Encoding As our main experiment uses Kernel Density Estimation (KDE) over (usually) heterogeneous datasets, we present an ablation for encoding tabular data to be numeric such that KDE can converge. We experiment with 3 common strategi...

work page 2017