Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation

Qian Ma; Sarah Rajtmajer

arxiv: 2604.07486 · v2 · submitted 2026-04-08 · 💻 cs.CR · cs.AI

Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation

Qian Ma , Sarah Rajtmajer This is my paper

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords synthetic data generationdifferential privacylarge language modelsprivacy preservationprivate seedstext datadata utilityprivacy-utility tradeoff

0 comments

The pith

Private seeds and differential privacy let public LLMs generate synthetic text that matches private data closely while protecting privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called RPSG for creating synthetic versions of private text data. It works by using private seeds to guide public large language models and applies a formal differential privacy mechanism when selecting output candidates. Experiments compare it to other private synthetic data methods and show it achieves high fidelity to the original private data along with strong privacy guarantees. A sympathetic reader would care because this could allow sharing or using data for machine learning and analysis in sensitive areas like healthcare or finance without risking exposure of real personal information.

Core claim

RPSG uses private seeds and integrates privacy-preserving strategies, including a formal differential privacy mechanism in the candidate selection, to generate realistic synthetic data from public LLMs that achieves high fidelity to private data while providing strong privacy protection.

What carries the argument

The RPSG approach, which seeds public LLMs with private data and uses DP in candidate selection to ensure both realism and privacy.

If this is right

Generated synthetic data can replace private data in training models or performing analyses with minimal utility loss.
Public LLMs become usable for private data tasks without needing to fine-tune them on sensitive information.
Strong privacy is maintained through formal DP guarantees even when using large public models.
Synthetic replicas maintain high similarity to private originals as measured in experiments against baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could lower barriers to data sharing in regulated industries by providing usable substitutes.
It suggests that hybrid private-public model setups might address privacy concerns in generative AI more broadly.
Future work might test if this scales to other modalities like images or structured data.

Load-bearing premise

That combining private seeds with public LLMs and differential privacy during selection will consistently produce realistic synthetic data without unexpected privacy leaks or significant drops in usefulness.

What would settle it

A successful membership inference attack or re-identification of original private data points from the synthetic outputs at rates exceeding the DP bounds, or downstream task performance on synthetic data falling short of private data baselines in controlled tests.

Figures

Figures reproduced from arXiv: 2604.07486 by Qian Ma, Sarah Rajtmajer.

**Figure 2.** Figure 2: Efficiency comparison on Reddit for generating 1,000 synthetic samples with no DP (ϵ = ∞). 0 0.2 0.4 0.6 0.8 1.0 1.2 Successful Extraction Rate(%) DeepSeek-R1 GPT-2 GPT-3.5 GPT-4o-mini Phi-4 12 10 7 5 5 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 5.** Figure 5: Effect of synthetic sample size and temperature on next-word prediction. 0 200 500 1000 2000 Synthetic Sample Size 0.13 0.15 0.17 0.19 0.21 0.23 0.25 FID Temperature temperature=0.2 temperature=0.5 temperature=0.8 temperature=1.0 temperature=1.2 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Large language models (LLMs) have emerged as a powerful tool for synthetic data generation. A particularly important use case is producing synthetic replicas of private text, which requires carefully balancing privacy and utility. We propose Realistic and Privacy-Preserving Synthetic Data Generation (RPSG), which uses private seeds and integrates privacy-preserving strategies, including a formal differential privacy (DP) mechanism in the candidate selection, to generate realistic synthetic data. Comprehensive experiments against state-of-the-art private synthetic data generation methods demonstrate that RPSG achieves high fidelity to private data while providing strong privacy protection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes RPSG as a new approach to private synthetic data generation using private seeds and public LLMs with DP in selection, but its claims rest on experiments not detailed in the abstract.

read the letter

The punchline is that this paper suggests a way to create synthetic private text data by seeding public LLMs with private information and using differential privacy to pick the best outputs. What stands out as new is the integration of private seeds with public LLMs combined with DP specifically in the candidate selection process. This is presented as distinct from previous private synthetic data approaches. The paper does a solid job of framing a practical solution for a real issue. Generating synthetic replicas of private text is valuable for training AI models in areas like healthcare and finance where data sharing is restricted. The abstract indicates that their method achieves better balance of fidelity and privacy than state-of-the-art alternatives. On the soft spots, the biggest one is the lack of visible details on the experiments. The abstract talks about comprehensive experiments demonstrating high fidelity and strong privacy, but without access to the methods section, the specific DP mechanism, the evaluation metrics, or the results tables, it's difficult to assess whether the claims hold up. The soundness is limited by this. If the full paper shows careful implementation and reproducible results, that would strengthen it considerably. The citation pattern seems standard, building on differential privacy and LLM capabilities, with no signs of circular reasoning. This paper is for researchers focused on privacy-preserving techniques for data generation and synthetic data in machine learning. Readers interested in applying LLMs to privacy-sensitive tasks would find the concept useful to consider. I recommend it for peer review. The idea is worth a closer examination by experts who can verify the experimental claims and the privacy guarantees.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes Realistic and Privacy-Preserving Synthetic Data Generation (RPSG), a method for creating synthetic replicas of private text data. It uses private seeds to guide public large language models (LLMs) and incorporates a formal differential privacy (DP) mechanism during candidate selection to balance fidelity and privacy. The central claim is that comprehensive experiments demonstrate RPSG achieves higher fidelity to the private data than state-of-the-art private synthetic data methods while providing strong privacy protection.

Significance. If the experimental results hold, the work could be significant for privacy-preserving data synthesis in sensitive domains. The approach of anchoring generation with private seeds while leveraging public LLMs and adding DP in candidate selection offers a practical alternative to training fully private models, potentially improving scalability and accessibility. The explicit use of a formal DP mechanism is a strength that distinguishes it from heuristic privacy approaches.

minor comments (2)

The abstract asserts 'comprehensive experiments' and 'superior performance' without naming the baseline methods, reporting any quantitative metrics (e.g., fidelity scores, privacy budgets), or specifying the DP parameters such as epsilon. Adding these details would strengthen the abstract's ability to convey the claims.
The description of the candidate selection step would benefit from a brief high-level pseudocode or diagram in the methods section to clarify how the formal DP mechanism is applied without revealing implementation details that could be moved to an appendix.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for summarizing our work on RPSG. We appreciate the recognition of the approach's potential significance as a practical alternative for privacy-preserving synthetic data generation that leverages public LLMs with private seeds and formal DP. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes RPSG as a practical method that combines private seeds, public LLMs, and a formal DP mechanism during candidate selection, then validates the approach via experiments against existing baselines. No derivation chain, equations, or first-principles claims appear in the abstract or method description that reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims rest on empirical fidelity and privacy measurements rather than tautological constructions, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.0 · 5385 in / 1128 out tokens · 38736 ms · 2026-05-10T17:24:21.107244+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

MIND: math informed synthetic dialogues for pretraining llms.CoRR, abs/2410.12881. Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Muñoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy- Poirier, Hailey Schoelkopf, Sergey Troshin, Dm...

work page arXiv 2023
[2]

In55th IEEE Annual Symposium on Foundations of Computer Sci- ence, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473

Private empirical risk minimization: Efficient algorithms and tight error bounds. In55th IEEE Annual Symposium on Foundations of Computer Sci- ence, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473. IEEE Computer Soci- ety. Rishi Bommasani, Steven Wu, and Xanda Schofield

work page 2014
[3]

A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,

Towards private synthetic text generation. In NeurIPS 2019 Machine Learning with Guarantees Workshop. Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun. 2023. A com- prehensive survey of ai-generated content (AIGC): A history of generative AI from GAN to chatgpt.CoRR, abs/2303.04226. Nicholas Carlini, Steve Chien, Milad ...

work page arXiv 2019
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.CoRR, abs/2501.12948. Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, and Sanjiv Kumar. 2024. No more hard prompts: Softsrv prompting for synthetic data generation.CoRR, abs/2410.16534. Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Da...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng

Incognitext: Privacy-enhancing conditional text anonymization via llm-based private attribute randomization.CoRR, abs/2407.02956. Anatoliy A. Gruzd and Ángel Hernández-García. 2018. Privacy concerns and self-disclosure in private and public uses of social media.Cyberpsychology Behav. Soc. Netw., 21(7):418–428. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas,...

work page arXiv 2018
[6]

In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023, pages 433:1–433:19

Evaluating large language models in gener- ating synthetic HCI research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023, pages 433:1–433:19. ACM. Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. 2023. More than a feel- ing: Accuracy a...

work page 2023
[7]

Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm

OpenReview.net. Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm

work page
[8]

CoRR, abs/2410.16811

Masked clinical modelling: A framework for synthetic and augmented survival data generation. CoRR, abs/2410.16811. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, and Andreas Terzis. 2023a. Harnessing large-language models to generate private synthetic text.arXiv preprint arXiv:2306.01684. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam...

work page arXiv 2019
[9]

Mireshghallah, A

Memorization in NLP fine-tuning methods. CoRR, abs/2205.12506. Ilya Mironov. 2017. Rényi differential privacy. In30th IEEE Computer Security Foundations Symposium, CSF 2017, Santa Barbara, CA, USA, August 21-25, 2017, pages 263–275. IEEE Computer Society. Ehsan Montahaei, Danial Alihosseini, and Mahdieh So- leymani Baghshah. 2019. Jointly measuring diver-...

work page arXiv 2017
[10]

In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings

Scalable private learning with PATE. In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open- Review.net. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empiri- c...

work page arXiv 2018
[11]

Keep positive tone:

OpenReview.net. Xiang Yue, Huseyin A. Inan, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, and Robert Sim. 2023. Synthetic text gener- ation with differential privacy: A simple and practical recipe. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL 20...

work page arXiv 2023
[12]

Each attack assesses whether a synthetic data sample is more likely to originate from the member or non-member subset of the training data

are employed on the fine-tuned BERT-small model, corresponding to AUC-based metrics: PPL, REFER, and LIRA. Each attack assesses whether a synthetic data sample is more likely to originate from the member or non-member subset of the training data. Dataset Generating Synthetic Variant Generating Synthetic Data Reddit Below is an abstracted self-disclosure s...

work page 2000
[13]

Temporal Distribution Shift: Some studies choose non-members from the same domain (e.g., Wikipedia) but from different tempo- ral snapshots, resulting in artificial temporal shifts rather than true membership signals

work page
[14]

Artificial Lexical Filtering: Certain eval- uations artificially eliminate overlapping n- grams or lexical similarities between member and non-member samples, creating unnatu- rally distinguishable datasets

work page
[15]

Single-Epoch Training: Evaluating MIAs on models trained for a single epoch on mas- sive datasets inherently limits memorization opportunities, misleadingly suggesting MIAs are ineffective

work page
[16]

Synthetic Non-member Generation: Some evaluations generate non-members by apply- ing minimal modifications (e.g., synonyms or paraphrasing) to member samples using LLMs, resulting in semantic overlap and in- validating true membership evaluation. D.3 How Our Methodology Avoids These Pitfalls Our rigorous approach explicitly avoids each of the pitfall iden...

work page
[17]

Since we do not rely on data collected across time windows, our evaluation avoids the confound- ing effects of temporal drift that can lead to inflated MIA results

Avoiding Temporal Distribution Shifts: Our member and non-member samples are drawn from disjoint subsets of the same data sources, ensuring they represent distinct data points without temporal and content overlap. Since we do not rely on data collected across time windows, our evaluation avoids the confound- ing effects of temporal drift that can lead to ...

work page
[18]

All data samples remain unmodified, preserving natu- ral content appearance and realistic evaluation conditions

No Artificial Lexical Filtering: We do not filter or manipulate member or non-member datasets to reduce lexical overlap. All data samples remain unmodified, preserving natu- ral content appearance and realistic evaluation conditions

work page
[19]

This realistic scenario fa- cilitates genuine memorization opportunities, thus providing a stringent test for MIA robust- ness

Realistic Multi-Epoch Training: Our surro- gate BERT-small model is trained for multiple epochs (typically three to five) on relatively small-scale data. This realistic scenario fa- cilitates genuine memorization opportunities, thus providing a stringent test for MIA robust- ness

work page
[20]

treat yourself

Valid Non-member Definition: We define non-members as private samples that were never used to train the surrogate model (i.e., not seen and accessed by BERT-small). Un- like approaches that generate non-members by rephrasing members, our evaluation uses dis- joint subsets of private data to ensure a clean membership distinction. By consciously and careful...

work page 2023

[1] [1]

MIND: math informed synthetic dialogues for pretraining llms.CoRR, abs/2410.12881. Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Muñoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy- Poirier, Hailey Schoelkopf, Sergey Troshin, Dm...

work page arXiv 2023

[2] [2]

In55th IEEE Annual Symposium on Foundations of Computer Sci- ence, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473

Private empirical risk minimization: Efficient algorithms and tight error bounds. In55th IEEE Annual Symposium on Foundations of Computer Sci- ence, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473. IEEE Computer Soci- ety. Rishi Bommasani, Steven Wu, and Xanda Schofield

work page 2014

[3] [3]

A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,

Towards private synthetic text generation. In NeurIPS 2019 Machine Learning with Guarantees Workshop. Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun. 2023. A com- prehensive survey of ai-generated content (AIGC): A history of generative AI from GAN to chatgpt.CoRR, abs/2303.04226. Nicholas Carlini, Steve Chien, Milad ...

work page arXiv 2019

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.CoRR, abs/2501.12948. Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, and Sanjiv Kumar. 2024. No more hard prompts: Softsrv prompting for synthetic data generation.CoRR, abs/2410.16534. Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Da...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng

Incognitext: Privacy-enhancing conditional text anonymization via llm-based private attribute randomization.CoRR, abs/2407.02956. Anatoliy A. Gruzd and Ángel Hernández-García. 2018. Privacy concerns and self-disclosure in private and public uses of social media.Cyberpsychology Behav. Soc. Netw., 21(7):418–428. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas,...

work page arXiv 2018

[6] [6]

In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023, pages 433:1–433:19

Evaluating large language models in gener- ating synthetic HCI research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023, pages 433:1–433:19. ACM. Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. 2023. More than a feel- ing: Accuracy a...

work page 2023

[7] [7]

Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm

OpenReview.net. Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm

work page

[8] [8]

CoRR, abs/2410.16811

Masked clinical modelling: A framework for synthetic and augmented survival data generation. CoRR, abs/2410.16811. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, and Andreas Terzis. 2023a. Harnessing large-language models to generate private synthetic text.arXiv preprint arXiv:2306.01684. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam...

work page arXiv 2019

[9] [9]

Mireshghallah, A

Memorization in NLP fine-tuning methods. CoRR, abs/2205.12506. Ilya Mironov. 2017. Rényi differential privacy. In30th IEEE Computer Security Foundations Symposium, CSF 2017, Santa Barbara, CA, USA, August 21-25, 2017, pages 263–275. IEEE Computer Society. Ehsan Montahaei, Danial Alihosseini, and Mahdieh So- leymani Baghshah. 2019. Jointly measuring diver-...

work page arXiv 2017

[10] [10]

In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings

Scalable private learning with PATE. In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open- Review.net. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empiri- c...

work page arXiv 2018

[11] [11]

Keep positive tone:

OpenReview.net. Xiang Yue, Huseyin A. Inan, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, and Robert Sim. 2023. Synthetic text gener- ation with differential privacy: A simple and practical recipe. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL 20...

work page arXiv 2023

[12] [12]

Each attack assesses whether a synthetic data sample is more likely to originate from the member or non-member subset of the training data

are employed on the fine-tuned BERT-small model, corresponding to AUC-based metrics: PPL, REFER, and LIRA. Each attack assesses whether a synthetic data sample is more likely to originate from the member or non-member subset of the training data. Dataset Generating Synthetic Variant Generating Synthetic Data Reddit Below is an abstracted self-disclosure s...

work page 2000

[13] [13]

Temporal Distribution Shift: Some studies choose non-members from the same domain (e.g., Wikipedia) but from different tempo- ral snapshots, resulting in artificial temporal shifts rather than true membership signals

work page

[14] [14]

Artificial Lexical Filtering: Certain eval- uations artificially eliminate overlapping n- grams or lexical similarities between member and non-member samples, creating unnatu- rally distinguishable datasets

work page

[15] [15]

Single-Epoch Training: Evaluating MIAs on models trained for a single epoch on mas- sive datasets inherently limits memorization opportunities, misleadingly suggesting MIAs are ineffective

work page

[16] [16]

Synthetic Non-member Generation: Some evaluations generate non-members by apply- ing minimal modifications (e.g., synonyms or paraphrasing) to member samples using LLMs, resulting in semantic overlap and in- validating true membership evaluation. D.3 How Our Methodology Avoids These Pitfalls Our rigorous approach explicitly avoids each of the pitfall iden...

work page

[17] [17]

Since we do not rely on data collected across time windows, our evaluation avoids the confound- ing effects of temporal drift that can lead to inflated MIA results

Avoiding Temporal Distribution Shifts: Our member and non-member samples are drawn from disjoint subsets of the same data sources, ensuring they represent distinct data points without temporal and content overlap. Since we do not rely on data collected across time windows, our evaluation avoids the confound- ing effects of temporal drift that can lead to ...

work page

[18] [18]

All data samples remain unmodified, preserving natu- ral content appearance and realistic evaluation conditions

No Artificial Lexical Filtering: We do not filter or manipulate member or non-member datasets to reduce lexical overlap. All data samples remain unmodified, preserving natu- ral content appearance and realistic evaluation conditions

work page

[19] [19]

This realistic scenario fa- cilitates genuine memorization opportunities, thus providing a stringent test for MIA robust- ness

Realistic Multi-Epoch Training: Our surro- gate BERT-small model is trained for multiple epochs (typically three to five) on relatively small-scale data. This realistic scenario fa- cilitates genuine memorization opportunities, thus providing a stringent test for MIA robust- ness

work page

[20] [20]

treat yourself

Valid Non-member Definition: We define non-members as private samples that were never used to train the surrogate model (i.e., not seen and accessed by BERT-small). Un- like approaches that generate non-members by rephrasing members, our evaluation uses dis- joint subsets of private data to ensure a clean membership distinction. By consciously and careful...

work page 2023