Evaluating LLM Simulators as Differentially Private Data Generators
Pith reviewed 2026-05-10 11:53 UTC · model grok-4.3
The pith
LLM simulators seeded with differentially private personas achieve fraud detection utility but exhibit distribution drift from model biases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PersonaLedger, when seeded with differentially private synthetic personas from real user statistics, generates transaction data that supports fraud detection with an AUC of 0.70 at epsilon=1; however, the resulting distributions diverge substantially from the seeds because the language model's learned priors override the provided statistics on temporal and demographic features.
What carries the argument
PersonaLedger, an agentic financial simulator that uses LLM agents to expand seeded personas into synthetic transaction histories while attempting to respect the input statistics.
If this is right
- LLM simulators can deliver practical utility on downstream tasks like fraud detection even when initialized from privacy-protected seeds.
- Systematic overrides by LLM priors produce drift concentrated in temporal and demographic variables.
- These identified failure modes must be corrected before LLM simulators can be used for the richer, higher-dimensional user representations where they would otherwise have an advantage.
- The approach complements traditional DP methods that struggle with complex profiles once bias mitigation is applied.
Where Pith is reading between the lines
- Prompt engineering or post-generation correction steps focused on temporal and demographic variables could be tested to reduce the observed overrides.
- The same seeding and evaluation approach could be applied to other privacy-sensitive domains such as health records or consumer behavior modeling.
- Direct head-to-head comparisons with purely statistical DP generators on identical datasets would quantify how much extra drift the LLM component introduces.
- Scaling the method to even higher-dimensional profiles becomes feasible only if the bias override problem is solved first.
Load-bearing premise
That the differentially private synthetic personas derived from real user statistics provide a seed accurate enough for the LLM to reproduce the underlying distributions without its pre-trained priors dominating.
What would settle it
A direct statistical test comparing histograms of temporal features such as transaction timing and demographic features such as age in the simulator outputs against the same features in the DP input personas, with large significant differences confirming drift and close matches supporting faithful reproduction.
Figures
read the original abstract
LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases--learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates LLM-based simulators (PersonaLedger) as generators of differentially private synthetic financial user profiles. It seeds the simulator with DP synthetic personas derived from real statistics and reports that PersonaLedger achieves fraud-detection utility of AUC 0.70 at ε=1 while exhibiting significant distribution drift in temporal and demographic features, which the authors attribute to LLM pre-trained priors overriding the provided input statistics.
Significance. If the empirical findings are robust, the work supplies concrete evidence on the practical limitations of current LLM simulators for high-dimensional DP data synthesis. The reported utility metric and identification of specific drift modes (temporal/demographic) could usefully guide mitigation research in privacy-preserving data generation for domains such as finance.
major comments (2)
- [Results / Experimental Setup] The abstract and results sections report concrete metrics (AUC 0.70 at ε=1) and attribute observed distribution drift to LLM biases, yet provide no details on experimental setup, number of runs, error bars, baselines, or the precise quantification of drift. This absence makes it impossible to assess the statistical reliability or reproducibility of the central claims.
- [Distribution Drift Analysis] The claim that drift arises because 'learned priors override input statistics' is not isolated from the confounding effect of DP noise already present in the seeded personas. Without a control condition that seeds the simulator with the original non-DP statistics (or an ε=∞ baseline) and measures the incremental deviation attributable to the DP mechanism alone, the attribution to LLM priors rests on an untested assumption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation of LLM simulators as differentially private data generators. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.
read point-by-point responses
-
Referee: [Results / Experimental Setup] The abstract and results sections report concrete metrics (AUC 0.70 at ε=1) and attribute observed distribution drift to LLM biases, yet provide no details on experimental setup, number of runs, error bars, baselines, or the precise quantification of drift. This absence makes it impossible to assess the statistical reliability or reproducibility of the central claims.
Authors: We agree that the original submission provided insufficient methodological details. In the revised manuscript, we have added a comprehensive 'Experimental Setup' section specifying: five independent simulation runs using different random seeds to capture variability; error bars as standard deviations across runs for the reported AUC and other metrics; baselines including non-LLM DP synthesizers (DP-CTGAN and Gaussian copula) as well as a non-private LLM control; and precise drift quantification via Jensen-Shannon divergence for categorical features and Wasserstein distance for continuous ones, with tabulated values for temporal and demographic attributes. These changes support reproducibility and statistical evaluation of the claims. revision: yes
-
Referee: [Distribution Drift Analysis] The claim that drift arises because 'learned priors override input statistics' is not isolated from the confounding effect of DP noise already present in the seeded personas. Without a control condition that seeds the simulator with the original non-DP statistics (or an ε=∞ baseline) and measures the incremental deviation attributable to the DP mechanism alone, the attribution to LLM priors rests on an untested assumption.
Authors: The referee correctly notes the potential for confounding between DP noise and LLM priors. While the core experiments focus on the practical DP-input setting, we have added a new control experiment in the revised manuscript that seeds PersonaLedger with the original non-DP statistics (ε=∞ baseline). Comparative analysis shows that drift in temporal and demographic features persists at similar magnitudes in the non-DP case, indicating LLM priors as the primary driver, with DP noise contributing only marginal additional deviation in select features. The 'Distribution Drift Analysis' section has been updated with these results and the attribution language adjusted to reflect the evidence. revision: yes
Circularity Check
No circularity: empirical evaluation with external benchmarks
full rationale
The paper is an empirical study evaluating PersonaLedger as a DP data generator, reporting measured AUC utility and observed distribution drift from experiments on seeded LLM simulations. No mathematical derivation chain exists that reduces predictions or claims to self-definitions, fitted parameters renamed as outputs, or load-bearing self-citations. Results depend on external data comparisons and experimental controls rather than internal constructions or ansatzes imported via prior author work. This matches the default expectation for non-circular empirical papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DP synthetic personas derived from real user statistics accurately represent the underlying distributions for seeding purposes
Reference graph
Works this paper leans on
-
[1]
Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Samir Mahiou, and Emiliano De Cristofaro
doi: 10.1145/3769764. Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Samir Mahiou, and Emiliano De Cristofaro. The importance of being discrete: Measuring the impact of discretization in end- to-end differentially private synthetic data. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25,
-
[2]
Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin
doi: 10.1145/3719027.3765091. Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Standardised metrics and methods for synthetic tabular data evaluation.IEEE Transactions on Knowledge and Data Engineering,
-
[3]
4 Published as a conference paper at ICLR 2026 Ryan McKenna, Gerome Miklau, and Daniel Sheldon. Winning the nist contest: A scalable and general approach to differentially private synthetic data.Journal of Privacy and Confidentiality, 11(3),
work page 2026
-
[4]
doi: 10.29012/jpc.778. Vamsi K Potluru, Daniel Borrajo, Andrea Coletta, Niccolo Dalmasso, Youssef El-Laham, Eloy Fons, Marzyeh Ghassemi, Sriram Gopalakrishnan, Vedant Gosai, Emilija Kreacic, Gautam Mani, Samuel Obitayo, Deepak Paramanand, Nandini Raman, Mikhail Solonin, Siddharth Sood, Svit- lana Vyetrenko, Haoyu Zhu, Manuela Veloso, and Tucker Balch. Syn...
-
[5]
Synthetic data privacy metrics, 2025
Amy Steier, Lipika Ramaswamy, Andre Manoel, and Alexa Haushalter. Synthetic data privacy metrics.arXiv preprint arXiv:2501.03941,
-
[6]
Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao
URLhttps://arxiv.org/abs/2601.03149. Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. InProceedings of the 2017 ACM In- ternational Conference on Management of Data, pp. 1423–1438,
-
[7]
5 Published as a conference paper at ICLR 2026 A DATASCHEMA Table 2: Summary Statistics Schema: 27 features extracted from transaction logs Column Name Type Bins Description is high risk binary 2 1 if User is in top 15% of perc fraud txns age years int 5 Current Age in years is retired binary 2 1 if Current Age≥Retirement Age gender binary 2 1 for female;...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.