pith. sign in

arxiv: 2604.07486 · v2 · submitted 2026-04-08 · 💻 cs.CR · cs.AI

Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords synthetic data generationdifferential privacylarge language modelsprivacy preservationprivate seedstext datadata utilityprivacy-utility tradeoff
0
0 comments X

The pith

Private seeds and differential privacy let public LLMs generate synthetic text that matches private data closely while protecting privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called RPSG for creating synthetic versions of private text data. It works by using private seeds to guide public large language models and applies a formal differential privacy mechanism when selecting output candidates. Experiments compare it to other private synthetic data methods and show it achieves high fidelity to the original private data along with strong privacy guarantees. A sympathetic reader would care because this could allow sharing or using data for machine learning and analysis in sensitive areas like healthcare or finance without risking exposure of real personal information.

Core claim

RPSG uses private seeds and integrates privacy-preserving strategies, including a formal differential privacy mechanism in the candidate selection, to generate realistic synthetic data from public LLMs that achieves high fidelity to private data while providing strong privacy protection.

What carries the argument

The RPSG approach, which seeds public LLMs with private data and uses DP in candidate selection to ensure both realism and privacy.

If this is right

  • Generated synthetic data can replace private data in training models or performing analyses with minimal utility loss.
  • Public LLMs become usable for private data tasks without needing to fine-tune them on sensitive information.
  • Strong privacy is maintained through formal DP guarantees even when using large public models.
  • Synthetic replicas maintain high similarity to private originals as measured in experiments against baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could lower barriers to data sharing in regulated industries by providing usable substitutes.
  • It suggests that hybrid private-public model setups might address privacy concerns in generative AI more broadly.
  • Future work might test if this scales to other modalities like images or structured data.

Load-bearing premise

That combining private seeds with public LLMs and differential privacy during selection will consistently produce realistic synthetic data without unexpected privacy leaks or significant drops in usefulness.

What would settle it

A successful membership inference attack or re-identification of original private data points from the synthetic outputs at rates exceeding the DP bounds, or downstream task performance on synthetic data falling short of private data baselines in controlled tests.

Figures

Figures reproduced from arXiv: 2604.07486 by Qian Ma, Sarah Rajtmajer.

Figure 1
Figure 1. Figure 1: Illustration of the RPSG Method Pipeline. Comparative performance of RPSG against DP-SGD ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency compar￾ison on Reddit for generating 1,000 synthetic samples with no DP (ϵ = ∞). 0 0.2 0.4 0.6 0.8 1.0 1.2 Successful Extraction Rate(%) DeepSeek-R1 GPT-2 GPT-3.5 GPT-4o-mini Phi-4 12 10 7 5 5 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of synthetic sample size and temperature on next-word prediction. 0 200 500 1000 2000 Synthetic Sample Size 0.13 0.15 0.17 0.19 0.21 0.23 0.25 FID Temperature temperature=0.2 temperature=0.5 temperature=0.8 temperature=1.0 temperature=1.2 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Large language models (LLMs) have emerged as a powerful tool for synthetic data generation. A particularly important use case is producing synthetic replicas of private text, which requires carefully balancing privacy and utility. We propose Realistic and Privacy-Preserving Synthetic Data Generation (RPSG), which uses private seeds and integrates privacy-preserving strategies, including a formal differential privacy (DP) mechanism in the candidate selection, to generate realistic synthetic data. Comprehensive experiments against state-of-the-art private synthetic data generation methods demonstrate that RPSG achieves high fidelity to private data while providing strong privacy protection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes Realistic and Privacy-Preserving Synthetic Data Generation (RPSG), a method for creating synthetic replicas of private text data. It uses private seeds to guide public large language models (LLMs) and incorporates a formal differential privacy (DP) mechanism during candidate selection to balance fidelity and privacy. The central claim is that comprehensive experiments demonstrate RPSG achieves higher fidelity to the private data than state-of-the-art private synthetic data methods while providing strong privacy protection.

Significance. If the experimental results hold, the work could be significant for privacy-preserving data synthesis in sensitive domains. The approach of anchoring generation with private seeds while leveraging public LLMs and adding DP in candidate selection offers a practical alternative to training fully private models, potentially improving scalability and accessibility. The explicit use of a formal DP mechanism is a strength that distinguishes it from heuristic privacy approaches.

minor comments (2)
  1. The abstract asserts 'comprehensive experiments' and 'superior performance' without naming the baseline methods, reporting any quantitative metrics (e.g., fidelity scores, privacy budgets), or specifying the DP parameters such as epsilon. Adding these details would strengthen the abstract's ability to convey the claims.
  2. The description of the candidate selection step would benefit from a brief high-level pseudocode or diagram in the methods section to clarify how the formal DP mechanism is applied without revealing implementation details that could be moved to an appendix.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for summarizing our work on RPSG. We appreciate the recognition of the approach's potential significance as a practical alternative for privacy-preserving synthetic data generation that leverages public LLMs with private seeds and formal DP. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes RPSG as a practical method that combines private seeds, public LLMs, and a formal DP mechanism during candidate selection, then validates the approach via experiments against existing baselines. No derivation chain, equations, or first-principles claims appear in the abstract or method description that reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims rest on empirical fidelity and privacy measurements rather than tautological constructions, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.0 · 5385 in / 1128 out tokens · 38736 ms · 2026-05-10T17:24:21.107244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    MIND: math informed synthetic dialogues for pretraining llms.CoRR, abs/2410.12881. Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Muñoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy- Poirier, Hailey Schoelkopf, Sergey Troshin, Dm...

  2. [2]

    In55th IEEE Annual Symposium on Foundations of Computer Sci- ence, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473

    Private empirical risk minimization: Efficient algorithms and tight error bounds. In55th IEEE Annual Symposium on Foundations of Computer Sci- ence, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473. IEEE Computer Soci- ety. Rishi Bommasani, Steven Wu, and Xanda Schofield

  3. [3]

    A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,

    Towards private synthetic text generation. In NeurIPS 2019 Machine Learning with Guarantees Workshop. Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun. 2023. A com- prehensive survey of ai-generated content (AIGC): A history of generative AI from GAN to chatgpt.CoRR, abs/2303.04226. Nicholas Carlini, Steve Chien, Milad ...

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.CoRR, abs/2501.12948. Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, and Sanjiv Kumar. 2024. No more hard prompts: Softsrv prompting for synthetic data generation.CoRR, abs/2410.16534. Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Da...

  5. [5]

    Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng

    Incognitext: Privacy-enhancing conditional text anonymization via llm-based private attribute randomization.CoRR, abs/2407.02956. Anatoliy A. Gruzd and Ángel Hernández-García. 2018. Privacy concerns and self-disclosure in private and public uses of social media.Cyberpsychology Behav. Soc. Netw., 21(7):418–428. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas,...

  6. [6]

    In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023, pages 433:1–433:19

    Evaluating large language models in gener- ating synthetic HCI research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023, pages 433:1–433:19. ACM. Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. 2023. More than a feel- ing: Accuracy a...

  7. [7]

    Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm

    OpenReview.net. Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm

  8. [8]

    CoRR, abs/2410.16811

    Masked clinical modelling: A framework for synthetic and augmented survival data generation. CoRR, abs/2410.16811. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, and Andreas Terzis. 2023a. Harnessing large-language models to generate private synthetic text.arXiv preprint arXiv:2306.01684. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam...

  9. [9]

    Mireshghallah, A

    Memorization in NLP fine-tuning methods. CoRR, abs/2205.12506. Ilya Mironov. 2017. Rényi differential privacy. In30th IEEE Computer Security Foundations Symposium, CSF 2017, Santa Barbara, CA, USA, August 21-25, 2017, pages 263–275. IEEE Computer Society. Ehsan Montahaei, Danial Alihosseini, and Mahdieh So- leymani Baghshah. 2019. Jointly measuring diver-...

  10. [10]

    In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings

    Scalable private learning with PATE. In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open- Review.net. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empiri- c...

  11. [11]

    Keep positive tone:

    OpenReview.net. Xiang Yue, Huseyin A. Inan, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, and Robert Sim. 2023. Synthetic text gener- ation with differential privacy: A simple and practical recipe. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL 20...

  12. [12]

    Each attack assesses whether a synthetic data sample is more likely to originate from the member or non-member subset of the training data

    are employed on the fine-tuned BERT-small model, corresponding to AUC-based metrics: PPL, REFER, and LIRA. Each attack assesses whether a synthetic data sample is more likely to originate from the member or non-member subset of the training data. Dataset Generating Synthetic Variant Generating Synthetic Data Reddit Below is an abstracted self-disclosure s...

  13. [13]

    Temporal Distribution Shift: Some studies choose non-members from the same domain (e.g., Wikipedia) but from different tempo- ral snapshots, resulting in artificial temporal shifts rather than true membership signals

  14. [14]

    Artificial Lexical Filtering: Certain eval- uations artificially eliminate overlapping n- grams or lexical similarities between member and non-member samples, creating unnatu- rally distinguishable datasets

  15. [15]

    Single-Epoch Training: Evaluating MIAs on models trained for a single epoch on mas- sive datasets inherently limits memorization opportunities, misleadingly suggesting MIAs are ineffective

  16. [16]

    Synthetic Non-member Generation: Some evaluations generate non-members by apply- ing minimal modifications (e.g., synonyms or paraphrasing) to member samples using LLMs, resulting in semantic overlap and in- validating true membership evaluation. D.3 How Our Methodology Avoids These Pitfalls Our rigorous approach explicitly avoids each of the pitfall iden...

  17. [17]

    Since we do not rely on data collected across time windows, our evaluation avoids the confound- ing effects of temporal drift that can lead to inflated MIA results

    Avoiding Temporal Distribution Shifts: Our member and non-member samples are drawn from disjoint subsets of the same data sources, ensuring they represent distinct data points without temporal and content overlap. Since we do not rely on data collected across time windows, our evaluation avoids the confound- ing effects of temporal drift that can lead to ...

  18. [18]

    All data samples remain unmodified, preserving natu- ral content appearance and realistic evaluation conditions

    No Artificial Lexical Filtering: We do not filter or manipulate member or non-member datasets to reduce lexical overlap. All data samples remain unmodified, preserving natu- ral content appearance and realistic evaluation conditions

  19. [19]

    This realistic scenario fa- cilitates genuine memorization opportunities, thus providing a stringent test for MIA robust- ness

    Realistic Multi-Epoch Training: Our surro- gate BERT-small model is trained for multiple epochs (typically three to five) on relatively small-scale data. This realistic scenario fa- cilitates genuine memorization opportunities, thus providing a stringent test for MIA robust- ness

  20. [20]

    treat yourself

    Valid Non-member Definition: We define non-members as private samples that were never used to train the surrogate model (i.e., not seen and accessed by BERT-small). Un- like approaches that generate non-members by rephrasing members, our evaluation uses dis- joint subsets of private data to ensure a clean membership distinction. By consciously and careful...