Privacy Auditing Synthetic Data Release through Local Likelihood Attacks
Pith reviewed 2026-05-18 20:14 UTC · model grok-4.3
The pith
A no-box membership inference attack uses local overfitting in tabular generative models to audit privacy leakage in synthetic data releases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic. Under a general model of local overfitting, the attack produces a provable mean-score gap between members and non-members of the training data, which yields testable predictions for attack success across datasets, model architectures, and parameters.
What carries the argument
The Gen-LRA score, which quantifies the influence of a test point on a surrogate model's local likelihood-ratio estimate over the synthetic data.
If this is right
- Gen-LRA consistently outperforms competing membership inference attacks across diverse tabular datasets, generative architectures, and attack parameters, especially at low false-positive rates.
- The attack requires no model knowledge or access, so it applies directly to black-box synthetic data releases.
- The theoretical framework supplies predictions for attack success that can be validated by measuring the degree of local overfitting in a given generator.
- Real-world synthetic data releases carry measurable privacy risks when generative models exhibit regional overfitting.
Where Pith is reading between the lines
- If local overfitting is the main source of leakage, then adding region-specific regularization to generators could shrink the mean-score gap and lower attack success.
- The localized density-ratio construction might extend to auditing privacy leakage in non-tabular data modalities where generative models still overfit locally.
- Testing Gen-LRA on generators that explicitly penalize regional memorization would provide a direct check of whether the predicted score gap disappears.
Load-bearing premise
Tabular generative models tend to significantly overfit to certain regions of the training distribution.
What would settle it
A controlled experiment in which a generative model is regularized to remove local overfitting regions, after which the mean Gen-LRA score gap between members and non-members vanishes and the attack performs no better than random guessing.
Figures
read the original abstract
Auditing the privacy leakage of synthetic data is an important but unresolved problem. Existing privacy auditing frameworks for synthetic data rely on heuristics and unrealistic assumptions about model access, offering limited ability to describe or detect the privacy exposure of training data through synthetic data release. In this paper, we study designing membership inference attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. We propose \emph{Generative Likelihood Ratio Attack} (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has on a surrogate model's estimate of a local likelihood ratio over the synthetic data. We develop a theoretical framework for the attack: we show that the Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic, and we prove that under a general model of local overfitting it produces a provable mean-score gap between members and non-members, yielding testable predictions for when the attack should succeed. We validate these predictions in a controlled simulation study and assess Gen-LRA against a comprehensive benchmark spanning diverse datasets, generative model architectures, and attack parameters. Across metrics, Gen-LRA consistently dominates competing MIAs, with especially strong gains at low false positive rates. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, and highlight the significant privacy risks posed by generative model overfitting in real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Generative Likelihood Ratio Attack (Gen-LRA), a no-box membership inference attack for auditing privacy leakage in synthetic tabular data. It claims a closed-form characterization of the Gen-LRA score as a localized density-ratio statistic and proves a mean-score gap between members and non-members under a parameterized local overfitting model of generative processes. The work validates the predictions via controlled simulations and reports superior performance over competing MIAs across diverse datasets, architectures, and metrics, with particular gains at low false-positive rates.
Significance. If the local-overfitting model holds for real tabular generators, the closed-form derivation and provable gap provide a principled, efficient auditing tool without model access, along with testable predictions for attack success. The simulation validation inside the assumed regime and the empirical dominance in benchmarks are notable strengths that could inform privacy standards for synthetic data release.
major comments (2)
- [Theoretical framework] Theoretical framework (around the mean-score gap derivation): The proof relies on a general model of local overfitting that assigns higher density mass to training points in localized regions. The manuscript does not independently quantify the degree or locality of overfitting present in the benchmarked models (e.g., diffusion or flow-based generators with regularization), so it remains unclear whether the provable gap applies outside the controlled simulation regime.
- [Benchmark experiments] Benchmark experiments section: The reported dominance of Gen-LRA is load-bearing for the practical claim, yet the evaluation does not include an ablation or measurement isolating the contribution of the local-overfitting assumption versus the underlying density-ratio estimator; without this, the results do not confirm that the theoretical gap drives the observed improvements.
minor comments (2)
- Ensure all equations in the closed-form characterization and proof are numbered and explicitly referenced in the text for traceability.
- Clarify the precise construction and hyperparameter choices for the surrogate model used to estimate the local likelihood ratio.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and describe the revisions we will make to strengthen the connection between theory and experiments.
read point-by-point responses
-
Referee: [Theoretical framework] Theoretical framework (around the mean-score gap derivation): The proof relies on a general model of local overfitting that assigns higher density mass to training points in localized regions. The manuscript does not independently quantify the degree or locality of overfitting present in the benchmarked models (e.g., diffusion or flow-based generators with regularization), so it remains unclear whether the provable gap applies outside the controlled simulation regime.
Authors: We agree that an independent quantification of local overfitting in the benchmarked models would help clarify the applicability of the mean-score gap beyond the simulation regime. In the revised manuscript we will add a dedicated analysis subsection that reports simple, model-agnostic indicators of local overfitting (e.g., the ratio of local density mass assigned to training versus held-out points within small neighborhoods of the synthetic data) for each generative model and dataset used in the benchmarks. These measurements will be presented alongside the existing attack results to allow readers to assess how well the modeling assumptions align with the observed behavior of real tabular generators. revision: yes
-
Referee: [Benchmark experiments] Benchmark experiments section: The reported dominance of Gen-LRA is load-bearing for the practical claim, yet the evaluation does not include an ablation or measurement isolating the contribution of the local-overfitting assumption versus the underlying density-ratio estimator; without this, the results do not confirm that the theoretical gap drives the observed improvements.
Authors: We acknowledge that an explicit ablation isolating the local-overfitting component would strengthen the empirical claims. Because the closed-form Gen-LRA statistic is derived directly from the local likelihood ratio, a perfect separation is not straightforward; however, we will add two targeted comparisons in the revised experiments: (1) a non-local baseline that replaces the localized density-ratio estimator with its global counterpart, and (2) a sweep over the locality radius parameter that controls how narrowly the attack focuses on local regions. These additions will provide evidence on the incremental benefit attributable to the local-overfitting modeling assumption while remaining within the scope of the existing evaluation framework. revision: partial
Circularity Check
No significant circularity; derivation self-contained under explicit assumptions
full rationale
The paper derives a closed-form characterization of the Gen-LRA score as a localized density-ratio statistic and proves a mean-score gap under a general parameterized model of local overfitting. This constitutes a theoretical derivation from stated premises rather than any reduction of the claimed result to its own fitted inputs, self-citations, or definitions by construction. No load-bearing steps match the enumerated circularity patterns; the simulation validates predictions inside the assumed regime without the central claim collapsing to a tautology via the paper's equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tabular generative models significantly overfit to certain regions of the training distribution.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that the Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic, and we prove that under a general model of local overfitting it produces a provable mean-score gap
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
tabular generative models tend to significantly overfit to certain regions of the training distribution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation
LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.
Reference graph
Works this paper leans on
-
[1]
truly anonymous synthetic data
URL https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-def initions-part-ii/. 9 Mostly AI. How to implement data privacy? a conversation with klaudius kalcher, 2021. URLhttps: //mostly.ai/data-democratization-podcast/how-to-implement-data-privacy/ . Ankur Ankan and Abinash Panda. pgmpy: Probabilistic graphical models using python. In Proceed- ing...
-
[2]
Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu
doi: 10.1109/EuroSP48549.2020.00040. Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu. Empirical evaluation on synthetic data generation with generative adversarial network. In Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics, WIMS2019, New York, NY , USA, 2019. Association for Computing Machinery. ISBN 9781450361903....
-
[3]
URL https://arxiv.org/abs/2406.13012. David S. Watson, Kristin Blesch, Jan Kapar, and Marvin N. Wright. Adversarial random forests for density estimation and generative modeling. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Pr...
-
[4]
URL https://doi.org/10.1051/itmconf/2018 2300037
doi: 10.1051/itmconf/20182300037. URL https://doi.org/10.1051/itmconf/2018 2300037. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. In Neural Information Processing Systems , 2019. URL https: //api.semanticscholar.org/CorpusID:195767064. Andrew Yale, Saloni Dash, Ritik Dutta, Isabe...
-
[5]
Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar
URL https://openreview.net/forum?id=S1zk9iRqF7. Jinsung Yoon, Lydia N Drumright, and Mihaela van der Schaar. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8):2378—2388, August 2020a. ISSN 2168-2194. doi: 10.1109/jbhi.2020.2980262. URL https://doi.org/10.1109/jbhi...
-
[6]
Abalone (OpenML): https://www.openml.org/search?type=data&sort=runs&id =183&status=active
-
[7]
Adult Becker and Kohavi [1996]
work page 1996
-
[8]
Bean (UCI): https://archive.ics.uci.edu/dataset/602/dry+bean+dataset
-
[9]
Churn-Modeling (Kaggle): https://www.kaggle.com/datasets/shrutimechlear n/churn-modelling
-
[10]
Faults (UCI): https://archive.ics.uci.edu/dataset/198/steel+plates+fault s
-
[11]
HTRU (UCI): https://archive.ics.uci.edu/dataset/372/htru2
-
[12]
Indian Liver Patient (Kaggle): https://www.kaggle.com/datasets/uciml/indian -liver-patient-records?resource=download
-
[13]
Insurance (Kaggle): https://www.kaggle.com/datasets/mirichoi0218/insuran ce
-
[14]
Magic (Kaggle): https://www.kaggle.com/datasets/abhinand05/magic-gamma -telescope-dataset?resource=download
-
[15]
News (UCI): https://archive.ics.uci.edu/dataset/332/online+news+popula rity
-
[16]
Nursery (Kaggle): https://www.kaggle.com/datasets/heitornunes/nursery
-
[17]
Obesity (Kaggle): https://www.kaggle.com/datasets/tathagatbanerjee/obesi ty-dataset-uci-ml 16
-
[18]
Shoppers (Kaggle): https://www.kaggle.com/datasets/henrysue/online-shopp ers-intention
-
[19]
Titanic (Kaggle): https://www.kaggle.com/c/titanic/data
-
[20]
Wilt (OpenML): https://www.openml.org/search?type=data&sort=runs&id=4 0983&status=active D Additional Results D.1 Gen-LRA Encoding As our main experiment uses Kernel Density Estimation (KDE) over (usually) heterogeneous datasets, we present an ablation for encoding tabular data to be numeric such that KDE can converge. We experiment with 3 common strategi...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.