pith. machine review for the scientific record. sign in

arxiv: 2605.12894 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM agentsuser simulatorspersona generationevolutionary searchrobust evaluationbehavioral variationsimulator training
0
0 comments X

The pith

Evolving Python generators for user personas creates simulators that train LLM agents to succeed 17 percent more often against uncooperative people.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM user simulators stay too cooperative and uniform, so agents that pass in simulation still fail when real users are unclear, impatient, or reluctant. The paper shows that casting persona generation as an evolutionary search over Python programs, guided by a multi-objective score for human-likeness and behavioral coverage, produces varied yet task-preserving simulators. Agents trained on the resulting personas handle out-of-distribution behaviors better and achieve 17 percent higher task success than agents trained only on existing simulators. The method works across retail and airline domains without changing the underlying tasks or rewards.

Core claim

Persona Policies (PPol) is a plug-and-play control layer that uses LLM-driven evolutionary program search to optimize a Python generator; the search discovers behaviors and translates them into roleplay policies that preserve original task goals while the multi-objective fitness score drives both human-likeness and broad coverage, yielding 33-62 percent fitness gains, 80.4 percent human ratings in blinded tests, and 17 percent higher agent success on challenging interactions.

What carries the argument

LLM-driven evolutionary program search that optimizes a Python generator for personas using a multi-objective fitness score of human-likeness plus coverage of human behavioral patterns.

If this is right

  • Agents trained with PPol improve task success by 17 percent relative to training only on existing simulated interactions.
  • PPol yields 33-62 percent absolute gains in fitness score over baseline simulators in retail and airline domains.
  • Annotators rate PPol-conditioned users as human 80.4 percent of the time, nearly twice as often as baseline simulators.
  • Once optimized, the generator produces a diverse population of human-like personas for any task in the domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evolutionary search could be applied to create simulators for other interactive systems such as customer-service bots or tutoring agents.
  • Measuring whether the evolved personas transfer to entirely new tasks would test how general the coverage of human patterns actually is.
  • Combining PPol with other training techniques could further reduce reliance on collecting large sets of real user interaction data.

Load-bearing premise

The multi-objective fitness score actually produces personas that match real human behavioral patterns rather than simply pleasing the LLM judge or the chosen coverage metrics.

What would settle it

Run a blinded test in a held-out domain where agents trained with PPol interact with actual human users and show no gain in task success rates over agents trained only on baseline simulators.

Figures

Figures reproduced from arXiv: 2605.12894 by Aylin Caliskan, Chirag Shah, Harshita Chopra, Kshitish Ghate, Natasha Jaques, Tadayoshi Kohno.

Figure 1
Figure 1. Figure 1: Overview of the Persona Policies (PPol) program-evolution loop. Given a task, program G generates N persona policies that vary in how simulated users communicate. Candidate programs are scored via task rollouts for human-likeness and behavioral coverage, while reflection feedback judges the quality of trajectories to guide refinement of the axes, sampling rules, and prompts. Once optimized, the generator c… view at source ↗
Figure 2
Figure 2. Figure 2: Samples of (A) Discovered Behavioral Axes (drives diversity in population), (B) Personas [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Monitoring Combined (Fitness) Score and Component Metrics during program evolution. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA projection of Humans, PPol and Base-Simulators in behavioral fingerprint￾space. Domain: Retail. Base and PPol Simu￾lator: DeepSeek-V3.1. Analyzing the Sørensen–Dice alignment (D1–D4) clarifies where these gains originate. While the base￾simulator already exhibits moderate alignment on basic information disclosure (D2), it fails on inter￾actional behaviors like clarification (D3) and error reaction (D4)… view at source ↗
Figure 5
Figure 5. Figure 5: Annotator Ratings evaluating the user side of Conversations. To validate that gains in behavioral fingerprint space correspond to human judgments, we ran a blinded eval￾uation on τ 2 -bench Retail conversations. We recruited 20 participants via Prolific [18] under an Institutional Review Board exemption. Each transcript paired the same assistant-agent (Gemma-4-31B) with one of three user sources: a real hu… view at source ↗
Figure 6
Figure 6. Figure 6: Annotator Ratings Interface. H Evolved Persona Generator Program Below is an example of an optimized persona generator program discovered during evolutionary search (Sec￾tion 3). This program was evaluated as part of the PPol framework and demonstrates how behavioral axes are defined and utilized in the POPULATION_PROMPT and ROLEPLAY_PROMPT to govern the behavior of the simulated user. """ This is evolutio… view at source ↗
read the original abstract

Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Persona Policies (PPol), a plug-and-play control layer that uses LLM-driven evolutionary program search to generate diverse, task-preserving user personas for simulators. It claims 33-62% absolute fitness gains over baseline simulators on tau^2-bench retail and airline domains, 80.4% human-likeness in blinded annotator ratings (nearly twice baseline), and +17% relative task success for agents trained on PPol-generated interactions versus existing simulated data.

Significance. If the generalization claims hold, PPol could meaningfully narrow the sim-to-real gap for LLM agent training and evaluation by inducing realistic behavioral variation without hand-crafted personas or new task definitions. The evolutionary search over Python generators is a technically interesting direction for scalable persona discovery.

major comments (3)
  1. [Abstract / §4] Abstract and §4 (Experiments): The +17% task success improvement is measured entirely within the tau^2-bench simulator ecosystem (PPol personas vs. baseline simulators); no held-out real-user interaction traces are used to measure agent success rates, leaving the central robustness-to-real-users claim unverified.
  2. [Abstract / §3] Abstract and §3 (Method): The multi-objective fitness score is described only at a high level (human-likeness + behavioral coverage); without the explicit definition of components, weights, or the precise procedure for computing human-likeness (LLM judge vs. human), it is impossible to assess whether the 33-62% gains reflect genuine behavioral diversity or optimization artifacts.
  3. [§4] §4: No mention of multiple random seeds or variance reporting for the evolutionary search; a single run leaves open whether the reported fitness and downstream +17% gains are stable or sensitive to initialization.
minor comments (1)
  1. [§3] Notation for the evolved generators and resulting personas is introduced without a clear running example; a concrete Python snippet in §3 would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point by point below, providing clarifications and committing to specific revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (Experiments): The +17% task success improvement is measured entirely within the tau^2-bench simulator ecosystem (PPol personas vs. baseline simulators); no held-out real-user interaction traces are used to measure agent success rates, leaving the central robustness-to-real-users claim unverified.

    Authors: We acknowledge that the +17% relative task success gain is measured by training and evaluating agents entirely within the tau^2-bench simulator using PPol-generated versus baseline interactions. This design isolates the effect of behavioral diversity while holding the underlying task and reward fixed. The blinded human evaluation (80.4% rated human, near real traces) and the 33-62% fitness gains provide supporting evidence that PPol personas better approximate real-user variation. We agree that direct evaluation on held-out real-user traces would offer stronger validation of sim-to-real robustness. In the revised manuscript we will add an explicit limitations paragraph in §4 and §5 clarifying that the reported gains constitute a controlled proxy evaluation and identifying real-user trace validation as important future work. revision: partial

  2. Referee: [Abstract / §3] Abstract and §3 (Method): The multi-objective fitness score is described only at a high level (human-likeness + behavioral coverage); without the explicit definition of components, weights, or the precise procedure for computing human-likeness (LLM judge vs. human), it is impossible to assess whether the 33-62% gains reflect genuine behavioral diversity or optimization artifacts.

    Authors: We agree that the current high-level description of the fitness function limits interpretability. In the revised §3 we will provide the complete mathematical definition of the multi-objective score, including: (i) the human-likeness term (LLM-as-judge prompt that compares generated dialogues against a held-out set of real human traces from the same domains), (ii) the behavioral coverage term (quantified via entropy over communication-style clusters and coverage of impatience/reluctance patterns), (iii) the explicit weighting (equal weights after normalization), and (iv) the aggregation and selection procedure used during evolutionary search. These additions will allow readers to verify that the reported gains arise from genuine diversity rather than optimization artifacts. revision: yes

  3. Referee: [§4] §4: No mention of multiple random seeds or variance reporting for the evolutionary search; a single run leaves open whether the reported fitness and downstream +17% gains are stable or sensitive to initialization.

    Authors: We concur that statistical robustness is essential for evolutionary methods. In the revised §4 we will rerun the evolutionary program search with five independent random seeds, reporting mean and standard deviation for both the final fitness scores (across retail and airline domains) and the downstream agent task-success rates. We will also include a brief sensitivity analysis showing that the top-performing programs from different seeds yield consistent +17% gains within one standard deviation, thereby demonstrating stability with respect to initialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes casting persona generation as an LLM-driven evolutionary program search optimizing a multi-objective fitness score (human-likeness plus behavioral coverage), then reports resulting fitness gains of 33-62% and separate blinded annotator ratings of 80.4% human-likeness close to real traces. The +17% task success improvement is measured on a distinct agent training/evaluation metric within tau^2-bench. No equations, self-citations, or uniqueness theorems are present that would reduce any load-bearing claim to its own inputs by construction. The optimization objective and downstream empirical results (annotator judgments, task success) remain independent, making the overall derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that LLMs can serve as effective evolutionary search operators and that a composite fitness score of human-likeness plus coverage produces generalizable personas. No explicit free parameters are named, but the fitness weights and search hyperparameters are implicit.

free parameters (1)
  • fitness score weights
    The relative weighting between human-likeness and behavioral coverage is not specified and must be chosen to produce the reported gains.
axioms (1)
  • domain assumption LLM-based evolutionary search can discover task-preserving yet behaviorally diverse personas
    The method assumes the search process yields human-like variation without explicit proof or external validation beyond the reported annotator study.
invented entities (1)
  • Persona Policies (PPol) no independent evidence
    purpose: Plug-and-play control layer that induces realistic behavioral variation in user simulators
    New method introduced to address cooperative bias in existing simulators.

pith-pipeline@v0.9.0 · 5624 in / 1403 out tokens · 30117 ms · 2026-05-14T20:21:25.850294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025

  2. [2]

    Deepseek-v3 technical report, 2024

    DeepSeek-AI. Deepseek-v3 technical report, 2024

  3. [3]

    Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of AI assistants? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35200–35278, 2025

  4. [4]

    Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024

    Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

  5. [5]

    Impatient users confuse AI agents: High-fidelity simulations of human traits for testing agents.arXiv preprint arXiv:2510.04491, 2025

    Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, and Nazneen Rajani. Impatient users confuse AI agents: High-fidelity simulations of human traits for testing agents.arXiv preprint arXiv:2510.04491, 2025

  6. [6]

    On overcoming miscal- ibrated conversational priors in LLM-based chatbots

    Christine Herlihy, Jennifer Neville, Tobias Schnabel, and Adith Swaminathan. On overcoming miscal- ibrated conversational priors in LLM-based chatbots. InUncertainty in Artificial Intelligence, pages 1599–1620. PMLR, 2024

  7. [7]

    How to approach ambiguous queries in conversational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

    Kimiya Keyvan and Jimmy Xiangji Huang. How to approach ambiguous queries in conversational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022. 10

  8. [8]

    Platolm: Teaching LLMs in multi- round dialogue via a user simulator

    Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and Benyou Wang. Platolm: Teaching LLMs in multi- round dialogue via a user simulator. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7863, 2024

  9. [9]

    LLMs get lost in multi-turn conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026

  10. [10]

    Toolsandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183, 2025

  11. [11]

    Duetsim: Building user simulator with dual large language models for task-oriented dialogues

    Xiang Luo, Zhiwen Tang, Jin Wang, and Xuejie Zhang. Duetsim: Building user simulator with dual large language models for task-oriented dialogues. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5414–5424, 2024

  12. [12]

    Crowdsourcing a word–emotion association lexicon.Computa- tional intelligence, 29(3):436–465, 2013

    Saif M Mohammad and Peter D Turney. Crowdsourcing a word–emotion association lexicon.Computa- tional intelligence, 29(3):436–465, 2013

  13. [13]

    Illuminating search spaces by mapping elites

    Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

  14. [14]

    Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

    Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

  15. [15]

    GPT-5.4 mini language model

    OpenAI. GPT-5.4 mini language model. Chat output, 2026. Generated on 2026-05-07

  16. [16]

    Persona generators: Generating diverse synthetic personas at scale.arXiv preprint arXiv:2602.03545, 2026

    Davide Paglieri, Logan Cross, William A Cunningham, Joel Z Leibo, and Alexander Sasha Vezhnevets. Persona generators: Generating diverse synthetic personas at scale.arXiv preprint arXiv:2602.03545, 2026

  17. [17]

    Pennebaker, Ryan L

    James W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate G. Blackburn. The development and psychometric properties of LIWC2015. 2015

  18. [18]

    Prolific.https://www.prolific.com, 2026

    Prolific. Prolific.https://www.prolific.com, 2026. Online participant recruitment platform

  19. [19]

    Reliable LLM-based user simulator for task-oriented dialogue systems

    Ivan Sekulic, Silvia Terragni, Victor Guimarães, Nghia Khau, Bruna Guedes, Modestas Filipavicius, Andre Ferreira Manso, and Roland Mathis. Reliable LLM-based user simulator for task-oriented dialogue systems. In Yvette Graham, Qun Liu, Gerasimos Lampouras, Ignacio Iacobacci, Sinead Madden, Haider Khalid, and Rameez Qureshi, editors,Proceedings of the 1st ...

  20. [20]

    Lost in simulation: LLM-simulated users are unreliable proxies for human users in agentic evaluations

    Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb- Tarrant. Lost in simulation: LLM-simulated users are unreliable proxies for human users in agentic evaluations. InAlgorithmic Fairness Across Alignment Procedures and Agentic Systems, 2026

  21. [21]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

  22. [22]

    Non-collaborative user simulators for tool agents

    Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon Kook, and Yohan Jo. Non-collaborative user simulators for tool agents. InThe Fourteenth International Conference on Learning Representations, 2026

  23. [23]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  24. [24]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

  25. [25]

    Know you first and be you better: Modeling human-like user simulators via implicit profiles

    Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, and Haizhou Li. Know you first and be you better: Modeling human-like user simulators via implicit profiles. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

  26. [26]

    MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. InThe Twelfth International Conference on Learning Representations, 2024. 11

  27. [27]

    Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia, July ...

  28. [28]

    behavior

    Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap. Mind the Sim2Real gap in user simulation for agentic tasks, 2026. Broader Impacts We introduce a method to generate realistic, diverse user behaviors for simulating task-oriented interactions. The primary p...

  29. [29]

    GROUND the persona in this specific Task Context and behavior profile

  30. [30]

    Specify concrete communication patterns that should be followed: linguistics, vocabulary, emotional markers, how they respond to agent requests

  31. [31]

    Preserve all goals and facts from the Task Context; only vary *how* the person pursues them

  32. [32]

    simulation

    Do NOT break the character — no mention of "simulation", "benchmark", or "AI". Respond with ONLY the roleplay instruction text:""" defgenerate_personas_detailed(c:str, axes: List[Dict[str, Any]], n:int) -> List[Dict[str, Any]]: """G(c, D, N) — the single public entrypoint. expanded_instruction of each persona is fed to the user simulator. """ population =...