pith. sign in

arxiv: 2604.15461 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.CL· cs.CR

Evaluating LLM Simulators as Differentially Private Data Generators

Pith reviewed 2026-05-10 11:53 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CR
keywords LLM simulatorsdifferential privacysynthetic data generationdistribution driftfraud detectionagentic simulationdata privacyPersonaLedger
0
0 comments X

The pith

LLM simulators seeded with differentially private personas achieve fraud detection utility but exhibit distribution drift from model biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language model simulators can generate complex synthetic data while respecting differential privacy, a setting where traditional methods often fail on high-dimensional user profiles. It does this by feeding DP-protected synthetic personas derived from real financial statistics into PersonaLedger, an agentic simulator that produces transaction histories. The evaluation shows the simulator reaches an AUC of 0.70 at epsilon equal to 1 on a downstream fraud detection task, yet the outputs display marked drift away from the input distributions, especially in temporal patterns and demographic traits. This occurs because the LLM's pre-trained priors override the statistics supplied in the seeds. The work therefore positions LLM simulators as potentially powerful for rich data but identifies specific biases that must be fixed before they can faithfully reproduce protected inputs.

Core claim

PersonaLedger, when seeded with differentially private synthetic personas from real user statistics, generates transaction data that supports fraud detection with an AUC of 0.70 at epsilon=1; however, the resulting distributions diverge substantially from the seeds because the language model's learned priors override the provided statistics on temporal and demographic features.

What carries the argument

PersonaLedger, an agentic financial simulator that uses LLM agents to expand seeded personas into synthetic transaction histories while attempting to respect the input statistics.

If this is right

  • LLM simulators can deliver practical utility on downstream tasks like fraud detection even when initialized from privacy-protected seeds.
  • Systematic overrides by LLM priors produce drift concentrated in temporal and demographic variables.
  • These identified failure modes must be corrected before LLM simulators can be used for the richer, higher-dimensional user representations where they would otherwise have an advantage.
  • The approach complements traditional DP methods that struggle with complex profiles once bias mitigation is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt engineering or post-generation correction steps focused on temporal and demographic variables could be tested to reduce the observed overrides.
  • The same seeding and evaluation approach could be applied to other privacy-sensitive domains such as health records or consumer behavior modeling.
  • Direct head-to-head comparisons with purely statistical DP generators on identical datasets would quantify how much extra drift the LLM component introduces.
  • Scaling the method to even higher-dimensional profiles becomes feasible only if the bias override problem is solved first.

Load-bearing premise

That the differentially private synthetic personas derived from real user statistics provide a seed accurate enough for the LLM to reproduce the underlying distributions without its pre-trained priors dominating.

What would settle it

A direct statistical test comparing histograms of temporal features such as transaction timing and demographic features such as age in the simulator outputs against the same features in the DP input personas, with large significant differences confirming drift and close matches supporting faithful reproduction.

Figures

Figures reproduced from arXiv: 2604.15461 by Dehao Yuan, Mayana Pereira, Nam H. Nguyen, Nassima M. Bouzid.

Figure 1
Figure 1. Figure 1: Experimental setup comparing two approaches: Profile-then-Simulate (left) aggregates [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases--learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript evaluates LLM-based simulators (PersonaLedger) as generators of differentially private synthetic financial user profiles. It seeds the simulator with DP synthetic personas derived from real statistics and reports that PersonaLedger achieves fraud-detection utility of AUC 0.70 at ε=1 while exhibiting significant distribution drift in temporal and demographic features, which the authors attribute to LLM pre-trained priors overriding the provided input statistics.

Significance. If the empirical findings are robust, the work supplies concrete evidence on the practical limitations of current LLM simulators for high-dimensional DP data synthesis. The reported utility metric and identification of specific drift modes (temporal/demographic) could usefully guide mitigation research in privacy-preserving data generation for domains such as finance.

major comments (2)
  1. [Results / Experimental Setup] The abstract and results sections report concrete metrics (AUC 0.70 at ε=1) and attribute observed distribution drift to LLM biases, yet provide no details on experimental setup, number of runs, error bars, baselines, or the precise quantification of drift. This absence makes it impossible to assess the statistical reliability or reproducibility of the central claims.
  2. [Distribution Drift Analysis] The claim that drift arises because 'learned priors override input statistics' is not isolated from the confounding effect of DP noise already present in the seeded personas. Without a control condition that seeds the simulator with the original non-DP statistics (or an ε=∞ baseline) and measures the incremental deviation attributable to the DP mechanism alone, the attribution to LLM priors rests on an untested assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of LLM simulators as differentially private data generators. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Results / Experimental Setup] The abstract and results sections report concrete metrics (AUC 0.70 at ε=1) and attribute observed distribution drift to LLM biases, yet provide no details on experimental setup, number of runs, error bars, baselines, or the precise quantification of drift. This absence makes it impossible to assess the statistical reliability or reproducibility of the central claims.

    Authors: We agree that the original submission provided insufficient methodological details. In the revised manuscript, we have added a comprehensive 'Experimental Setup' section specifying: five independent simulation runs using different random seeds to capture variability; error bars as standard deviations across runs for the reported AUC and other metrics; baselines including non-LLM DP synthesizers (DP-CTGAN and Gaussian copula) as well as a non-private LLM control; and precise drift quantification via Jensen-Shannon divergence for categorical features and Wasserstein distance for continuous ones, with tabulated values for temporal and demographic attributes. These changes support reproducibility and statistical evaluation of the claims. revision: yes

  2. Referee: [Distribution Drift Analysis] The claim that drift arises because 'learned priors override input statistics' is not isolated from the confounding effect of DP noise already present in the seeded personas. Without a control condition that seeds the simulator with the original non-DP statistics (or an ε=∞ baseline) and measures the incremental deviation attributable to the DP mechanism alone, the attribution to LLM priors rests on an untested assumption.

    Authors: The referee correctly notes the potential for confounding between DP noise and LLM priors. While the core experiments focus on the practical DP-input setting, we have added a new control experiment in the revised manuscript that seeds PersonaLedger with the original non-DP statistics (ε=∞ baseline). Comparative analysis shows that drift in temporal and demographic features persists at similar magnitudes in the non-DP case, indicating LLM priors as the primary driver, with DP noise contributing only marginal additional deviation in select features. The 'Distribution Drift Analysis' section has been updated with these results and the attribution language adjusted to reflect the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with external benchmarks

full rationale

The paper is an empirical study evaluating PersonaLedger as a DP data generator, reporting measured AUC utility and observed distribution drift from experiments on seeded LLM simulations. No mathematical derivation chain exists that reduces predictions or claims to self-definitions, fitted parameters renamed as outputs, or load-bearing self-citations. Results depend on external data comparisons and experimental controls rather than internal constructions or ansatzes imported via prior author work. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that DP personas faithfully encode real statistics and that fraud detection AUC is a valid proxy for simulator utility; no free parameters are explicitly fitted in the reported results, and no new entities are postulated.

axioms (1)
  • domain assumption DP synthetic personas derived from real user statistics accurately represent the underlying distributions for seeding purposes
    Invoked when the simulator is seeded with these personas to test reproduction of statistics.

pith-pipeline@v0.9.0 · 5411 in / 1300 out tokens · 40104 ms · 2026-05-10T11:53:41.891765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

  1. [1]

    Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Samir Mahiou, and Emiliano De Cristofaro

    doi: 10.1145/3769764. Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Samir Mahiou, and Emiliano De Cristofaro. The importance of being discrete: Measuring the impact of discretization in end- to-end differentially private synthetic data. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25,

  2. [2]

    Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin

    doi: 10.1145/3719027.3765091. Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Standardised metrics and methods for synthetic tabular data evaluation.IEEE Transactions on Knowledge and Data Engineering,

  3. [3]

    Winning the nist contest: A scalable and general approach to differentially private synthetic data.Journal of Privacy and Confidentiality, 11(3),

    4 Published as a conference paper at ICLR 2026 Ryan McKenna, Gerome Miklau, and Daniel Sheldon. Winning the nist contest: A scalable and general approach to differentially private synthetic data.Journal of Privacy and Confidentiality, 11(3),

  4. [4]

    doi: 10.29012/jpc.778. Vamsi K Potluru, Daniel Borrajo, Andrea Coletta, Niccolo Dalmasso, Youssef El-Laham, Eloy Fons, Marzyeh Ghassemi, Sriram Gopalakrishnan, Vedant Gosai, Emilija Kreacic, Gautam Mani, Samuel Obitayo, Deepak Paramanand, Nandini Raman, Mikhail Solonin, Siddharth Sood, Svit- lana Vyetrenko, Haoyu Zhu, Manuela Veloso, and Tucker Balch. Syn...

  5. [5]

    Synthetic data privacy metrics, 2025

    Amy Steier, Lipika Ramaswamy, Andre Manoel, and Alexa Haushalter. Synthetic data privacy metrics.arXiv preprint arXiv:2501.03941,

  6. [6]

    Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao

    URLhttps://arxiv.org/abs/2601.03149. Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. InProceedings of the 2017 ACM In- ternational Conference on Management of Data, pp. 1423–1438,

  7. [7]

    5 Published as a conference paper at ICLR 2026 A DATASCHEMA Table 2: Summary Statistics Schema: 27 features extracted from transaction logs Column Name Type Bins Description is high risk binary 2 1 if User is in top 15% of perc fraud txns age years int 5 Current Age in years is retired binary 2 1 if Current Age≥Retirement Age gender binary 2 1 for female;...