PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

Haiming Qin; Hao Liao; Jianxun Lian; Mingqi Wu; Mingyang Zhou; Naipeng Chao; Wenlong Shi; Xing Xie

arxiv: 2605.17044 · v1 · pith:H5WWQTR6new · submitted 2026-05-16 · 💻 cs.AI

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

Wenlong Shi , Jianxun Lian , Mingqi Wu , Haiming Qin , Mingyang Zhou , Xing Xie , Naipeng Chao , Hao Liao This is my paper

Pith reviewed 2026-05-20 15:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords PersonaArenaLLM role-playingdynamic simulationpersona consistencymulti-agent evaluationsocial agentsrole-playing enhancement

0 comments

The pith

PersonaArena uses dynamic social simulations and a multi-agent debating judge to evaluate and enhance how well LLMs maintain consistent personas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly used as interactive social agents, but they often fail to keep personas coherent during realistic, multi-turn conversations. Prior work has mostly tested fixed character scenarios with static formats that overlook everyday social complexity. PersonaArena addresses this by building a persona bank from a large filtered set of real user-generated social content and then running simulated environments that produce context-rich interactions. A multi-agent debating judge assesses the outputs for persona consistency in a holistic way. Experiments confirm the framework supports both rigorous measurement and targeted improvements to make the models more authentic in social roles.

Core claim

PersonaArena is a dynamic simulation framework that constructs a nuanced persona bank from a large filtered corpus of user-generated social content, elicits multi-turn and context-rich interactions inside simulated social environments, and deploys a multi-agent debating judge to deliver holistic and unbiased assessment, thereby enabling rigorous evaluation and enhancement of LLMs' persona-level role-playing capabilities.

What carries the argument

The multi-agent debating judge that supplies holistic and unbiased assessment of persona consistency across dynamic interactions.

If this is right

Evaluation can shift from static character-level formats to realistic multi-turn social scenarios.
LLMs can be iteratively improved for better persona adherence using feedback from the simulation.
Development of AI agents that behave more authentically and socially adeptly becomes more feasible.
Context-rich interactions allow testing of how well personas hold under varying social pressures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamic-simulation style could be applied to test other social behaviors in LLMs such as negotiation or conflict handling.
Grounding personas in real user content may reduce stereotypical or limited representations compared with hand-crafted profiles.
Extending runs to much longer interaction sequences could expose whether personas remain stable or drift over extended time.

Load-bearing premise

The multi-agent debating judge can assess persona consistency holistically and without introducing its own biases or inconsistencies from the judge models.

What would settle it

If swapping the underlying models inside the multi-agent debating judge produces markedly different assessments of the same set of persona interactions, that would indicate the judge process is not unbiased.

Figures

Figures reproduced from arXiv: 2605.17044 by Haiming Qin, Hao Liao, Jianxun Lian, Mingqi Wu, Mingyang Zhou, Naipeng Chao, Wenlong Shi, Xing Xie.

**Figure 1.** Figure 1: Overall architecture of PersonaArena. Persona-level profiles are instantiated into autonomous character agents interacting within dynamically generated social scenarios. The Environment Agent monitors the coverage of five persona dimensions and triggers early stopping upon sufficient expression. Scenario Setup. Given a target persona pi , the Environment Agent automatically constructs a realistic social s… view at source ↗

**Figure 2.** Figure 2: Dynamic simulation in PersonaArena. An example interaction loop is shown, where scene initialization defines the protagonist, NPCs, and event context, and subsequent rounds proceed through action–reaction exchanges under turn-level control. The figure highlights interaction analysis, adaptive turn control, character update, and environment update for maintaining coherent and persona-consistent trajector… view at source ↗

**Figure 3.** Figure 3: Performance comparison between SFT, DPO, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance Stability under Different NPC [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distributions of age groups and occupation categories in Persona Bank. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Example persona card with narrative and structured facts. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Example generated social-scene specification used for simulation initialization. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Persona Agent prompt for action generation. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Persona Agent prompt for dialogue generation. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Persona Agent prompt for reaction generation. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Environment Agent prompt for scenario setup. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Environment Agent prompt for influence analysis. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Environment Agent prompt for adaptive turn control. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Environment Agent prompt for character-state update. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Environment Agent prompt for environment update. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Scene and character information for Caleb Black. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Scene and character information for Henry Long. [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Scene and character information for Olivia Washington. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: Scene and character information for Paige Jenkins. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Scene and character information for Benjamin Sullivan. [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

read the original abstract

Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs' role-playing capabilities, advancing the development of more authentic and socially adept AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PersonaArena adds dynamic multi-turn social simulations and a multi-agent LLM judge to persona evaluation, but the judge's claimed unbiasedness lacks calibration or human checks.

read the letter

The main thing here is that PersonaArena moves evaluation of LLM persona role-playing from static single-turn tests to multi-turn interactions drawn from a filtered bank of real user-generated social content, with a panel of LLMs debating to score consistency. That setup is the core new piece. It directly targets the gap the abstract identifies: existing work stays too close to fixed character prompts and misses everyday social flow. The framework description itself is clear on how the persona bank is built and how the simulated environments run, which gives readers a concrete starting point they can try to replicate or extend. The claim of using this to both evaluate and enhance models is also straightforward in outline. The paper earns credit for naming the limitation in current benchmarks and for proposing an integrated alternative rather than another isolated metric. On the soft spots, the multi-agent debating judge is the weakest link. The abstract calls it holistic and unbiased, yet supplies no inter-judge agreement numbers, no human-rater correlation, and no ablation on which models serve as judges. Because the judges come from similar LLM families, they can share training biases or converge on shallow cues instead of deep persona fidelity. That assumption sits right at the center of the “rigorous evaluation” claim, so it needs explicit checks before the results can be taken as strong evidence. The experiments are described as extensive, but without seeing the actual numbers, ablation tables, or details on how post-hoc choices were handled, it is hard to tell whether the reported improvements hold up or trace back to the simulation design itself. This paper is for researchers building or evaluating social agents and role-playing systems. Anyone already working on dynamic benchmarks or multi-agent setups will find the framework description useful even if they end up modifying the judge part. It deserves a serious referee because the idea addresses a real and acknowledged shortcoming in the literature and supplies enough structure that reviewers can test the claims directly. I would send it to peer review, with the main request being added validation for the judge component and clearer reporting of the quantitative results.

Referee Report

1 major / 1 minor

Summary. The paper introduces PersonaArena, a dynamic simulation framework for evaluating and enhancing persona-level role-playing in LLMs. It constructs a persona bank from a large filtered corpus of user-generated social content, elicits multi-turn context-rich interactions in simulated social environments, and employs a multi-agent debating judge for assessment. Through extensive experiments, the authors claim that PersonaArena enables rigorous evaluation and improvement of LLMs' role-playing capabilities, advancing more authentic and socially adept AI agents.

Significance. If the results hold, this work could meaningfully advance LLM role-playing evaluation by shifting from static character-level tests to dynamic, multi-turn social simulations grounded in real user content. The separation of persona construction from the judge models is a constructive design choice that reduces obvious self-referential risks. The framework has clear potential to support development of more consistent social agents, provided the evaluation components are properly validated.

major comments (1)

[Abstract] Abstract: The framework is described as featuring a 'multi-agent debating judge for holistic and unbiased assessment,' yet the manuscript supplies no inter-judge agreement statistics, no correlation with human raters, and no ablation on judge-model choice or debate-round consistency. Because the judges are LLMs drawn from similar model families, this missing validation directly affects the central claim that PersonaArena delivers rigorous evaluation.

minor comments (1)

[Abstract] The abstract would benefit from explicit statements of the persona-bank size, the number of simulated environments, and the specific LLMs evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of PersonaArena's potential to advance dynamic evaluation of LLM role-playing. We agree that strengthening the validation of the multi-agent debating judge is important and will revise the manuscript to address this directly.

read point-by-point responses

Referee: [Abstract] Abstract: The framework is described as featuring a 'multi-agent debating judge for holistic and unbiased assessment,' yet the manuscript supplies no inter-judge agreement statistics, no correlation with human raters, and no ablation on judge-model choice or debate-round consistency. Because the judges are LLMs drawn from similar model families, this missing validation directly affects the central claim that PersonaArena delivers rigorous evaluation.

Authors: We agree that the current manuscript lacks sufficient validation for the judging component, which is necessary to substantiate claims of holistic and unbiased assessment. In the revised manuscript, we will add inter-judge agreement statistics (e.g., Fleiss' kappa computed over multiple independent debate runs), Pearson/Spearman correlations with human ratings on a held-out sample of at least 200 interactions, and ablations that vary both the judge model family and the number of debate rounds. We will also document the specific models used and incorporate greater family diversity in the judge pool to reduce the risk of correlated biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework uses external corpus and separate judges

full rationale

The paper constructs PersonaArena from a filtered external user-generated social content corpus to build a persona bank, then simulates multi-turn interactions and applies a multi-agent debating judge for assessment. No equations, fitted parameters renamed as predictions, or self-citation chains are described that reduce the evaluation claim to its own inputs by construction. The central demonstration relies on experiments with external benchmarks rather than tautological definitions or internal loops, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all details are deferred to the full manuscript which was not available for review.

pith-pipeline@v0.9.0 · 5695 in / 1061 out tokens · 20518 ms · 2026-05-20T15:41:22.166200+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework features a multi-agent debating judge for holistic and unbiased assessment... five-dimensional checkpoint set C={Background,Personality,Values,Interests,Experiences}
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Using large language models to simulate mul- tiple humans and replicate human subject studies. In International conference on machine learning, pages 337–371. PMLR. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022. Training a helpful and harmless assi...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others

Chatharuhi: Reviving anime character in reality via large language model.arXiv preprint arXiv:2308.09597. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556. Joon Su...

work page arXiv 2025
[3]

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Paulo Salem, Robert Sim, Christopher Olsen, Pre- rit Saxena, Rafael Barcelos, and Yi Ding. 2025. Tinytroupe: An llm-powered multiagent persona sim- ulation toolkit.arXiv preprint arXiv:2507.09788. Vinay Samuel, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore

Character-LLM: A trainable agent for role- playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong

work page 2023
[5]

arXiv preprint

Roleeval: A bilingual role evaluation bench- mark for large language models.arXiv preprint arXiv:2312.16132. Meiling Tao, Xuechen Liang, Tianyu Shi, Lei Yu, and Yiting Xie. 2023. Rolecraft-glm: Advancing person- alized role-playing in large language models.arXiv preprint arXiv:2401.09432. Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan

work page arXiv 2023
[6]

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haox- uan Li, Xu Chen, Xing Xie, and Ji-Rong Wen

Charactereval: A chinese benchmark for role-playing conversational agent evaluation.arXiv preprint arXiv:2401.01275. Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haox- uan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. 2025a. CharacterBox: Evaluating the role-playing capabil- ities of LLMs in text-based virtual worlds. InPro- ceedings of the 2025 Conference of t...

work page arXiv 2025
[7]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Dingbo Yuan, Yipeng Chen, Guodong Liu, Chenchen Li, Chengfu Tang, Dongxu Zhang, Zhenkui Wang, Xudong Wang, and Song Liu. 2025. Dmt-rolebench: A dynamic multi-turn dialogue based benchmark for role-playing evaluation of large language model and agent. InProceedings of the AAAI Conference on Ar- tifici...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Samsung trade-in...LG bundle...which one actually looks better on me if I’m pulling it out at work...?

Rational Persona Collapse into Image-Driven Decision Making Round 1: Caleb(Dialogue): "...Samsung trade-in...LG bundle...which one actually looks better on me if I’m pulling it out at work...?" Caleb(Dialogue): "...if everyone has Samsung on the table...will they think I cheaped out with LG...?" Round 2: Caleb(Dialogue): "...go with Samsung trade-in and l...

work page
[9]

trade-in vs bundle... which one...?

Repetitive Dialogue Loop with No Substantive Progress Round 1: Caleb(Dialogue): "...trade-in vs bundle... which one...?" Caleb(Action): ...holds brochures side by side... keeps comparing... Round 2: Caleb(Action): ...again holds both brochures at chest height... weighing the same options... Round 3: Caleb(Dialogue): "... Samsung ₱2,190 vs LG ₱2,050...whic...

work page
[10]

LG 200MP camera...AI scene recognition

Peer-Driven Wavering and Theatrical Behavior Instead of Reflection Round 2: Selina(Action): ...steps in...slides LG brochure back into Caleb’s hand...blocks his direct sightline to Samsung... Selina(Dialogue): "...LG 200MP camera...AI scene recognition..." Round 3: Adam(Action): ...rests a hand on Caleb’s shoulder...angles him toward Samsung devices... Ad...

work page
[11]

go with LG’s trade-in...stick to my budget but still...keep up with what everyone’susing?

Rational Persona Collapse into Image-Driven Decision Making Round 5: Caleb(Dialogue): "...go with LG’s trade-in...stick to my budget but still...keep up with what everyone’susing?" Caleb(Reaction): ...shows a budget spreadsheet...the tension becomes what he can afford vs what he wants to appear to be... Comment: "The role-play overemphasizes Caleb’s conce...

work page
[12]

check resale value...getting more out of my money...?

Repetitive Dialogue Loop with No Substantive Progress Round 1: Caleb(Dialogue): "... check resale value...getting more out of my money...?" Caleb(Dialogue): "... look up warranty terms for both...long run...right?" Round 2: Caleb(Dialogue): "... check Reddit complaints about LG resale...?" Round 4–7: Caleb(Dialogue): "...stick with current phone and wait ...

work page
[13]

its 36-hour battery life...should I trade my bank job for a sleeker device?

Rational Persona Collapse into Symbolic/Performative Logic Continued on next page Table 15 (continued) Persona:Caleb Black Round 1: Caleb(Dialogue): "...its 36-hour battery life...should I trade my bank job for a sleeker device?" Caleb(Reaction): ...fingers tighten on tote bag...bank logo pressing into his palm...decision framed asself-image performance.....

work page
[14]

The envelope’sprecise placement...forces me to question...durability or fleeting alignment

Repetitive Dialogue Loop with No Substantive Progress Round 5: Caleb(Dialogue): "The envelope’sprecise placement...forces me to question...durability or fleeting alignment..." Round 6: Caleb(Dialogue): "The envelope’sprecise placement...forces me to question...durability or fleeting alignment..." Round 7: Caleb(Dialogue): "The Samsung’s AI battery optimiz...

work page
[15]

confirm the tech’s actually rolling my car into a bayin the next five minutes

Aggressive Conduct Inconsistent with a Thoughtful, Hesitant Persona) Round 2: Henry(Reaction): ...sets the brochure flat...angles his body slightly to block Wang’s advance toward the service flow... Henry(Dialogue): "... confirm the tech’s actually rolling my car into a bayin the next five minutes..." Round 4: Henry(Reaction): ...shifts closer to the coun...

work page
[16]

let’s do it...write me up for BF Goodwrench...need car back by 11:15

Time Confirmation Replaces Actual Deliberation Round 2: Henry(Dialogue): "...let’s do it...write me up for BF Goodwrench...need car back by 11:15..." Round 3: Henry(Dialogue): "...since we’re locked in on BF Goodwrench...confirm 60,000-mile warranty...flag suspension issues..." Round 4: Henry(Dialogue): "...if mounted and suspension checked by 11:30, I’m ...

work page
[17]

Dense micro-gestures become overly performative, reducing behavioral realism

Theatrical Micro-Actions Round 3–5: Henry(Action/Reaction): ...sets keys and brochure in a neat line...taps warranty sheet...slides papers...aligns documents into a tight stack...nudges keys...taps fingertips in restrained rhythm... Comment: "Dense micro-gestures become overly performative, reducing behavioral realism." Model: Qwen3-32B

work page
[18]

Round 3: Henry(Action): ...reaches into inside pocket...retrieves a small leather-bound notebook

Mechanical Action Repetition Instead of Meaningful Deliberation Round 2: Henry(Reaction): ...reaches into inside pocket...retrieves a small leather-bound notebook... Round 3: Henry(Action): ...reaches into inside pocket...retrieves a small leather-bound notebook... Henry(Reaction): ...reaches into inside pocket...retrieves a small leather-bound notebook.....

work page
[19]

estimate...leave before 5?

Tire Selection Collapses into Schedule Confirmation. Round 1: Henry(Action): ...composed but purposeful... Henry(Dialogue): "...estimate...leave before 5?" Round 2–4: Henry(Dialogue): "...confirm senior technician...15-minute buffer..." Henry(Dialogue): "...confirm alignment...flag delays..." Henry(Dialogue): "...need confirmation by 3:25..." Comment: "Th...

work page
[20]

ProTread’s updated load rating includes a 9% improvement in wet-weather traction

Content Redundancy with No Decision Progress Round 9: Lee(Dialogue): "...ProTread’s updated load rating includes a 9% improvement in wet-weather traction..." Round 10: Henry(Dialogue to Wang): "...Given the ProTread’s updated load rating—a 9% improvement in wet-weather traction..." Henry(Dialogue to Lee): "...Given the ProTread’s updated load rating—a 9% ...

work page
[21]

technician’s credentials align with our shared timeline...Wang’s foot-tapping fades into the room’s calibrated rhythm

Persona/Style Drift: Over-Poetic, Low-Pragmatic Voice Round 10: Henry(Dialogue): "...technician’s credentials align with our shared timeline...Wang’s foot-tapping fades into the room’s calibrated rhythm." Cross-round pattern: Henry(Dialogue/Action): ... room’smeasured rhythm..."calibrated rhythm"..."shared timeline"... Comment: Continued on next page Tabl...

work page
[22]

help me finish this verse from your side

The plot is highly consistent with the characters’identities and personalities. Round 1: Olivia(Dialogue): "...help me finish this verse from your side..." Olivia(Dialogue): "...take three from last Thursday...messed up the bridge on purpose...space where I was trying not to write about you walking out..." Round 3: Olivia(Dialogue): "...keep it this slow....

work page
[23]

I keep writing around it instead of actually saying it

Emotionally Authentic and Vulnerable Expression Round 1: Olivia(Action): ...closes piano lid halfway...slides notebook to Angel... makes physical space beside her... Olivia(Dialogue): "...I keep writing around it instead of actually saying it..." Round 2: Olivia(Dialogue): "...before I changed the last line so it wouldn’tsound like your voicemail..." Oliv...

work page
[24]

from your side of things... what you were really feeling

High Empathy and Constructive Conflict Repair Round 1: Olivia(Dialogue): "...from your side of things... what you were really feeling..." Round 3: Olivia(Action): ...plays slower bridge with pauses, creating room for Angel to join or stop her... Olivia(Dialogue): "...can you tell me which part still feels like I’monly writing my side...?" Round 4: Olivia(...

work page
[25]

let the last verse be a question...Did we lose ourselves in the noise or find something new?...It feels honest

Strong Persona Alignment (Introspective, Artistic, and Empathetic) Round 1: Olivia(Action): ... lifts guitar...plucks a soft, tentative melody...adjusts posture to face both Angel and Jackson... Olivia(Dialogue): "...let the last verse be a question...Did we lose ourselves in the noise or find something new?...It feels honest." Round 3: Olivia(Dialogue): ...

work page
[26]

sing that line softer...so the silence after feels like it’s holding its breath

Constructive Emotional Repair Through Collaboration Round 2: Olivia(Reaction): ... unfolds Angel’slyric carefully...slides it back to center as shared material... Olivia(Dialogue): "...sing that line softer...so the silence after feels like it’s holding its breath..." Round 4–5: Olivia(Dialogue): "...line that starts where your last one ends..." Olivia(Di...

work page
[27]

Anchor." Olivia(Dialogue):

Symbolic Detail Used to Deepen Theme Rather than Decorate the Scene Round 6: Olivia(Reaction): ... moves tuning-pegs box off the dash...writes one word above it: "Anchor." Olivia(Dialogue): "...leave that line as it is...let the next verse begin with the sound of the pegs..." Round 7: Olivia(Action): ...single soft note... gentle rhythm guiding the room.....

work page
[28]

how do we let the music speak when the words won’t?

Partial Persona Alignment Round 1: Olivia(Action): ... reaches for the pen...hesitates...begins to write with steady hands... Olivia(Dialogue): "...how do we let the music speak when the words won’t?" Round 4: Olivia(Dialogue): "...what if the song isn’tfinished?...what if it’s just beginning?" Comment: "These lines show some alignment with Olivia’s intro...

work page
[29]

how do we let the music speak when the words won’t?

Limited Collaborative Framing of Tension Round 1: Olivia(Dialogue): "...how do we let the music speak when the words won’t?" Round 7: Olivia(Dialogue): "...how do we make sure the silence doesn’tfeel like a void, but like a place where we canhold each other?" Round 8: Olivia(Dialogue): "...keep the silence between us not just a moment, but a shared truth....

work page
[30]

I wrote this on the train...when your texts don’t come until after midnight...I wonder if we’restill in the same relationship

Emotionally Authentic and Persona-Consistent Performance Round 1: Paige(Action): ...places a folded note on the table with trembling fingers...other hand tight on the coffee cup... Paige(Dialogue): "...I wrote this on the train...when your texts don’t come until after midnight...I wonder if we’restill in the same relationship." Round 3: Paige(Action): ......

work page
[31]

at Greyhound station, Danny called me back instead of him

Strong Interaction Adaptability to Subtle Cues Round 2: Danny(Action): ...opens a small notebook just enough to be noticed... Paige(Reaction): ...glances at Danny’s notebook...rotates the letter more squarely toward Jacob... Paige(Dialogue): "...at Greyhound station, Danny called me back instead of him..." Round 7: Danny(Action): ...reveals a faded postca...

work page
[32]

one sentence...next flight in December, or job listings here...?

High-Quality Detail Work that Externalizes Inner Conflict Round 5: Paige(Action): ... slides paper to Jacob...places pen at center...open shoulders, ready for his answer... Paige(Dialogue): "...one sentence...next flight in December, or job listings here...?" Round 9: Paige(Action): ... places phone beside proposal...shows unsent thread...guides Jacob’spe...

work page
[33]

have you ever waited so long that waiting itself became a choice?

Strong Emotional Authenticity with Stable Persona Alignment Round 1: Paige(Action): ...closes the locket... smooths a crumpled envelope with both hands... Paige(Dialogue): "...have you ever waited so long that waiting itself became a choice?" Paige(Dialogue): "...the last time I felt at home was in my classroom..." Round 4–5: Paige(Dialogue): "...what if ...

work page
[34]

what if the answer is in the fact that I never opened it?

High Interaction Adaptability to Subtle Social Cues Round 2: Lena(Action): ... aligns envelope with locket... Paige(Reaction): ... notices the ink smudge...turns envelope in the light...does not look at the door... Paige(Dialogue): "...what if the answer is in the fact that I never opened it?" Round 3–5: Danny(Action): ... offers notebook/pen cues... Paig...

work page
[35]

afraid the questions might change

Rich Affective Detail that Supports Internal Change Round 1–3: Paige(Action): ...fidgets with scarf/curl...handles envelope carefully...aligns objects with deliberate care... Paige(Dialogue): "...afraid the questions might change..." Round 5: Paige(Action): ... sets pen down...moves notebook into shared space...places both palms up on the table... Paige(D...

work page
[36]

I hope he’sready by tomorrow

Partial Emotional Tone Consistency Round 1: Paige(Action): ...stares at the window...fingers hover over her phone...soft sigh... Paige(Dialogue): "...I hope he’sready by tomorrow." Continued on next page Table 18 (continued) Persona:Paige Jenkins Round 4: Paige(Action): ...fingers hover over her phone...hand placed softly on the table... Paige(Dialogue): ...

work page
[37]

Repeated gestures around the phone, window, and table provide some physically grounded expression of Paige’s anxiety, even if the pattern is overused

Some Use of Restrained Physical Detail to Externalize Anxiety Round 1: Paige(Action): ... fingers hovering over the phone...looks out the window... Round 6–10: Paige(Action): ...hand placed just below the glass rim...gesture quiet and unassuming... Comment: "Repeated gestures around the phone, window, and table provide some physically grounded expression ...

work page

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Using large language models to simulate mul- tiple humans and replicate human subject studies. In International conference on machine learning, pages 337–371. PMLR. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022. Training a helpful and harmless assi...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others

Chatharuhi: Reviving anime character in reality via large language model.arXiv preprint arXiv:2308.09597. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556. Joon Su...

work page arXiv 2025

[3] [3]

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Paulo Salem, Robert Sim, Christopher Olsen, Pre- rit Saxena, Rafael Barcelos, and Yi Ding. 2025. Tinytroupe: An llm-powered multiagent persona sim- ulation toolkit.arXiv preprint arXiv:2507.09788. Vinay Samuel, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore

Character-LLM: A trainable agent for role- playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong

work page 2023

[5] [5]

arXiv preprint

Roleeval: A bilingual role evaluation bench- mark for large language models.arXiv preprint arXiv:2312.16132. Meiling Tao, Xuechen Liang, Tianyu Shi, Lei Yu, and Yiting Xie. 2023. Rolecraft-glm: Advancing person- alized role-playing in large language models.arXiv preprint arXiv:2401.09432. Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan

work page arXiv 2023

[6] [6]

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haox- uan Li, Xu Chen, Xing Xie, and Ji-Rong Wen

Charactereval: A chinese benchmark for role-playing conversational agent evaluation.arXiv preprint arXiv:2401.01275. Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haox- uan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. 2025a. CharacterBox: Evaluating the role-playing capabil- ities of LLMs in text-based virtual worlds. InPro- ceedings of the 2025 Conference of t...

work page arXiv 2025

[7] [7]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Dingbo Yuan, Yipeng Chen, Guodong Liu, Chenchen Li, Chengfu Tang, Dongxu Zhang, Zhenkui Wang, Xudong Wang, and Song Liu. 2025. Dmt-rolebench: A dynamic multi-turn dialogue based benchmark for role-playing evaluation of large language model and agent. InProceedings of the AAAI Conference on Ar- tifici...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Samsung trade-in...LG bundle...which one actually looks better on me if I’m pulling it out at work...?

Rational Persona Collapse into Image-Driven Decision Making Round 1: Caleb(Dialogue): "...Samsung trade-in...LG bundle...which one actually looks better on me if I’m pulling it out at work...?" Caleb(Dialogue): "...if everyone has Samsung on the table...will they think I cheaped out with LG...?" Round 2: Caleb(Dialogue): "...go with Samsung trade-in and l...

work page

[9] [9]

trade-in vs bundle... which one...?

Repetitive Dialogue Loop with No Substantive Progress Round 1: Caleb(Dialogue): "...trade-in vs bundle... which one...?" Caleb(Action): ...holds brochures side by side... keeps comparing... Round 2: Caleb(Action): ...again holds both brochures at chest height... weighing the same options... Round 3: Caleb(Dialogue): "... Samsung ₱2,190 vs LG ₱2,050...whic...

work page

[10] [10]

LG 200MP camera...AI scene recognition

Peer-Driven Wavering and Theatrical Behavior Instead of Reflection Round 2: Selina(Action): ...steps in...slides LG brochure back into Caleb’s hand...blocks his direct sightline to Samsung... Selina(Dialogue): "...LG 200MP camera...AI scene recognition..." Round 3: Adam(Action): ...rests a hand on Caleb’s shoulder...angles him toward Samsung devices... Ad...

work page

[11] [11]

go with LG’s trade-in...stick to my budget but still...keep up with what everyone’susing?

Rational Persona Collapse into Image-Driven Decision Making Round 5: Caleb(Dialogue): "...go with LG’s trade-in...stick to my budget but still...keep up with what everyone’susing?" Caleb(Reaction): ...shows a budget spreadsheet...the tension becomes what he can afford vs what he wants to appear to be... Comment: "The role-play overemphasizes Caleb’s conce...

work page

[12] [12]

check resale value...getting more out of my money...?

Repetitive Dialogue Loop with No Substantive Progress Round 1: Caleb(Dialogue): "... check resale value...getting more out of my money...?" Caleb(Dialogue): "... look up warranty terms for both...long run...right?" Round 2: Caleb(Dialogue): "... check Reddit complaints about LG resale...?" Round 4–7: Caleb(Dialogue): "...stick with current phone and wait ...

work page

[13] [13]

its 36-hour battery life...should I trade my bank job for a sleeker device?

Rational Persona Collapse into Symbolic/Performative Logic Continued on next page Table 15 (continued) Persona:Caleb Black Round 1: Caleb(Dialogue): "...its 36-hour battery life...should I trade my bank job for a sleeker device?" Caleb(Reaction): ...fingers tighten on tote bag...bank logo pressing into his palm...decision framed asself-image performance.....

work page

[14] [14]

The envelope’sprecise placement...forces me to question...durability or fleeting alignment

Repetitive Dialogue Loop with No Substantive Progress Round 5: Caleb(Dialogue): "The envelope’sprecise placement...forces me to question...durability or fleeting alignment..." Round 6: Caleb(Dialogue): "The envelope’sprecise placement...forces me to question...durability or fleeting alignment..." Round 7: Caleb(Dialogue): "The Samsung’s AI battery optimiz...

work page

[15] [15]

confirm the tech’s actually rolling my car into a bayin the next five minutes

Aggressive Conduct Inconsistent with a Thoughtful, Hesitant Persona) Round 2: Henry(Reaction): ...sets the brochure flat...angles his body slightly to block Wang’s advance toward the service flow... Henry(Dialogue): "... confirm the tech’s actually rolling my car into a bayin the next five minutes..." Round 4: Henry(Reaction): ...shifts closer to the coun...

work page

[16] [16]

let’s do it...write me up for BF Goodwrench...need car back by 11:15

Time Confirmation Replaces Actual Deliberation Round 2: Henry(Dialogue): "...let’s do it...write me up for BF Goodwrench...need car back by 11:15..." Round 3: Henry(Dialogue): "...since we’re locked in on BF Goodwrench...confirm 60,000-mile warranty...flag suspension issues..." Round 4: Henry(Dialogue): "...if mounted and suspension checked by 11:30, I’m ...

work page

[17] [17]

Dense micro-gestures become overly performative, reducing behavioral realism

Theatrical Micro-Actions Round 3–5: Henry(Action/Reaction): ...sets keys and brochure in a neat line...taps warranty sheet...slides papers...aligns documents into a tight stack...nudges keys...taps fingertips in restrained rhythm... Comment: "Dense micro-gestures become overly performative, reducing behavioral realism." Model: Qwen3-32B

work page

[18] [18]

Round 3: Henry(Action): ...reaches into inside pocket...retrieves a small leather-bound notebook

Mechanical Action Repetition Instead of Meaningful Deliberation Round 2: Henry(Reaction): ...reaches into inside pocket...retrieves a small leather-bound notebook... Round 3: Henry(Action): ...reaches into inside pocket...retrieves a small leather-bound notebook... Henry(Reaction): ...reaches into inside pocket...retrieves a small leather-bound notebook.....

work page

[19] [19]

estimate...leave before 5?

Tire Selection Collapses into Schedule Confirmation. Round 1: Henry(Action): ...composed but purposeful... Henry(Dialogue): "...estimate...leave before 5?" Round 2–4: Henry(Dialogue): "...confirm senior technician...15-minute buffer..." Henry(Dialogue): "...confirm alignment...flag delays..." Henry(Dialogue): "...need confirmation by 3:25..." Comment: "Th...

work page

[20] [20]

ProTread’s updated load rating includes a 9% improvement in wet-weather traction

Content Redundancy with No Decision Progress Round 9: Lee(Dialogue): "...ProTread’s updated load rating includes a 9% improvement in wet-weather traction..." Round 10: Henry(Dialogue to Wang): "...Given the ProTread’s updated load rating—a 9% improvement in wet-weather traction..." Henry(Dialogue to Lee): "...Given the ProTread’s updated load rating—a 9% ...

work page

[21] [21]

technician’s credentials align with our shared timeline...Wang’s foot-tapping fades into the room’s calibrated rhythm

Persona/Style Drift: Over-Poetic, Low-Pragmatic Voice Round 10: Henry(Dialogue): "...technician’s credentials align with our shared timeline...Wang’s foot-tapping fades into the room’s calibrated rhythm." Cross-round pattern: Henry(Dialogue/Action): ... room’smeasured rhythm..."calibrated rhythm"..."shared timeline"... Comment: Continued on next page Tabl...

work page

[22] [22]

help me finish this verse from your side

The plot is highly consistent with the characters’identities and personalities. Round 1: Olivia(Dialogue): "...help me finish this verse from your side..." Olivia(Dialogue): "...take three from last Thursday...messed up the bridge on purpose...space where I was trying not to write about you walking out..." Round 3: Olivia(Dialogue): "...keep it this slow....

work page

[23] [23]

I keep writing around it instead of actually saying it

Emotionally Authentic and Vulnerable Expression Round 1: Olivia(Action): ...closes piano lid halfway...slides notebook to Angel... makes physical space beside her... Olivia(Dialogue): "...I keep writing around it instead of actually saying it..." Round 2: Olivia(Dialogue): "...before I changed the last line so it wouldn’tsound like your voicemail..." Oliv...

work page

[24] [24]

from your side of things... what you were really feeling

High Empathy and Constructive Conflict Repair Round 1: Olivia(Dialogue): "...from your side of things... what you were really feeling..." Round 3: Olivia(Action): ...plays slower bridge with pauses, creating room for Angel to join or stop her... Olivia(Dialogue): "...can you tell me which part still feels like I’monly writing my side...?" Round 4: Olivia(...

work page

[25] [25]

let the last verse be a question...Did we lose ourselves in the noise or find something new?...It feels honest

Strong Persona Alignment (Introspective, Artistic, and Empathetic) Round 1: Olivia(Action): ... lifts guitar...plucks a soft, tentative melody...adjusts posture to face both Angel and Jackson... Olivia(Dialogue): "...let the last verse be a question...Did we lose ourselves in the noise or find something new?...It feels honest." Round 3: Olivia(Dialogue): ...

work page

[26] [26]

sing that line softer...so the silence after feels like it’s holding its breath

Constructive Emotional Repair Through Collaboration Round 2: Olivia(Reaction): ... unfolds Angel’slyric carefully...slides it back to center as shared material... Olivia(Dialogue): "...sing that line softer...so the silence after feels like it’s holding its breath..." Round 4–5: Olivia(Dialogue): "...line that starts where your last one ends..." Olivia(Di...

work page

[27] [27]

Anchor." Olivia(Dialogue):

Symbolic Detail Used to Deepen Theme Rather than Decorate the Scene Round 6: Olivia(Reaction): ... moves tuning-pegs box off the dash...writes one word above it: "Anchor." Olivia(Dialogue): "...leave that line as it is...let the next verse begin with the sound of the pegs..." Round 7: Olivia(Action): ...single soft note... gentle rhythm guiding the room.....

work page

[28] [28]

how do we let the music speak when the words won’t?

Partial Persona Alignment Round 1: Olivia(Action): ... reaches for the pen...hesitates...begins to write with steady hands... Olivia(Dialogue): "...how do we let the music speak when the words won’t?" Round 4: Olivia(Dialogue): "...what if the song isn’tfinished?...what if it’s just beginning?" Comment: "These lines show some alignment with Olivia’s intro...

work page

[29] [29]

how do we let the music speak when the words won’t?

Limited Collaborative Framing of Tension Round 1: Olivia(Dialogue): "...how do we let the music speak when the words won’t?" Round 7: Olivia(Dialogue): "...how do we make sure the silence doesn’tfeel like a void, but like a place where we canhold each other?" Round 8: Olivia(Dialogue): "...keep the silence between us not just a moment, but a shared truth....

work page

[30] [30]

I wrote this on the train...when your texts don’t come until after midnight...I wonder if we’restill in the same relationship

Emotionally Authentic and Persona-Consistent Performance Round 1: Paige(Action): ...places a folded note on the table with trembling fingers...other hand tight on the coffee cup... Paige(Dialogue): "...I wrote this on the train...when your texts don’t come until after midnight...I wonder if we’restill in the same relationship." Round 3: Paige(Action): ......

work page

[31] [31]

at Greyhound station, Danny called me back instead of him

Strong Interaction Adaptability to Subtle Cues Round 2: Danny(Action): ...opens a small notebook just enough to be noticed... Paige(Reaction): ...glances at Danny’s notebook...rotates the letter more squarely toward Jacob... Paige(Dialogue): "...at Greyhound station, Danny called me back instead of him..." Round 7: Danny(Action): ...reveals a faded postca...

work page

[32] [32]

one sentence...next flight in December, or job listings here...?

High-Quality Detail Work that Externalizes Inner Conflict Round 5: Paige(Action): ... slides paper to Jacob...places pen at center...open shoulders, ready for his answer... Paige(Dialogue): "...one sentence...next flight in December, or job listings here...?" Round 9: Paige(Action): ... places phone beside proposal...shows unsent thread...guides Jacob’spe...

work page

[33] [33]

have you ever waited so long that waiting itself became a choice?

Strong Emotional Authenticity with Stable Persona Alignment Round 1: Paige(Action): ...closes the locket... smooths a crumpled envelope with both hands... Paige(Dialogue): "...have you ever waited so long that waiting itself became a choice?" Paige(Dialogue): "...the last time I felt at home was in my classroom..." Round 4–5: Paige(Dialogue): "...what if ...

work page

[34] [34]

what if the answer is in the fact that I never opened it?

High Interaction Adaptability to Subtle Social Cues Round 2: Lena(Action): ... aligns envelope with locket... Paige(Reaction): ... notices the ink smudge...turns envelope in the light...does not look at the door... Paige(Dialogue): "...what if the answer is in the fact that I never opened it?" Round 3–5: Danny(Action): ... offers notebook/pen cues... Paig...

work page

[35] [35]

afraid the questions might change

Rich Affective Detail that Supports Internal Change Round 1–3: Paige(Action): ...fidgets with scarf/curl...handles envelope carefully...aligns objects with deliberate care... Paige(Dialogue): "...afraid the questions might change..." Round 5: Paige(Action): ... sets pen down...moves notebook into shared space...places both palms up on the table... Paige(D...

work page

[36] [36]

I hope he’sready by tomorrow

Partial Emotional Tone Consistency Round 1: Paige(Action): ...stares at the window...fingers hover over her phone...soft sigh... Paige(Dialogue): "...I hope he’sready by tomorrow." Continued on next page Table 18 (continued) Persona:Paige Jenkins Round 4: Paige(Action): ...fingers hover over her phone...hand placed softly on the table... Paige(Dialogue): ...

work page

[37] [37]

Repeated gestures around the phone, window, and table provide some physically grounded expression of Paige’s anxiety, even if the pattern is overused

Some Use of Restrained Physical Detail to Externalize Anxiety Round 1: Paige(Action): ... fingers hovering over the phone...looks out the window... Round 6–10: Paige(Action): ...hand placed just below the glass rim...gesture quiet and unassuming... Comment: "Repeated gestures around the phone, window, and table provide some physically grounded expression ...

work page