Understanding Generalization in Role-Playing Models via Information Theory

Fei Huang; Hao Lang; Tieyun Qian; Yongbin Li; Yongqi Li

arxiv: 2512.17270 · v2 · submitted 2025-12-19 · 💻 cs.LG · cs.AI· cs.CL

Understanding Generalization in Role-Playing Models via Information Theory

Yongqi Li , Hao Lang , Fei Huang , Tieyun Qian , Yongbin Li This is my paper

Pith reviewed 2026-05-16 21:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords role-playing modelsgeneralizationdistribution shiftsinformation theorymutual informationreinforcement learningR-EMID

0 comments

The pith

An information-theoretic metric called R-EMID quantifies how user, character, and dialogue shifts degrade role-playing model performance, with user shift proving most harmful.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Role-playing models lose effectiveness in real deployments because of distribution shifts in users, characters, and dialogue composition. The paper defines R-EMID as a reasoning-based effective mutual information difference that measures this degradation in an interpretable way. It derives a theoretical upper bound on R-EMID that predicts worst-case generalization loss and shows how each type of shift contributes to the drop. A co-evolving reinforcement learning procedure is introduced to model the joint dependencies among user, character, and context so that the probability of generating the correct response can be estimated more accurately for computing R-EMID. Experiments across models confirm that user shift produces the largest R-EMID increase while the reinforcement learning approach yields the strongest generalization gains.

Core claim

R-EMID measures the drop in effective mutual information between context and response when RPMs encounter user, character, or compositional shifts. An upper bound on R-EMID is derived that directly predicts the worst-case generalization performance without requiring exhaustive testing. The bound decomposes the total degradation into additive contributions from each shift type. A co-evolving reinforcement learning framework is then used to adaptively learn the joint distribution over user, character, and dialogue context, which improves the estimation of the conditional response probabilities required to compute R-EMID. Empirical evaluation shows that user shift produces the largest value of,

What carries the argument

R-EMID, the reasoning-based effective mutual information difference that compares mutual information between context and response under shifted versus original distributions to quantify generalization loss.

If this is right

User shift produces the largest R-EMID increase and therefore the greatest generalization risk among the three shift types.
The derived upper bound on R-EMID supplies a computable worst-case performance guarantee for any RPM under a given collection of shifts.
Reinforcement learning yields larger reductions in R-EMID than supervised fine-tuning or other adaptation baselines.
Targeted mitigation of user shift alone can substantially lower overall generalization loss without addressing character or compositional shifts.
Improved estimation of response generation probability directly lowers the R-EMID value and thereby improves measured generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If R-EMID can be computed during training, it could serve as an auxiliary objective that directly penalizes sensitivity to user shift.
The decomposition of R-EMID by shift type suggests that separate adapters for user modeling could be combined with a single character model.
The same mutual-information approach may extend to other conditional generation tasks such as dialogue summarization or story continuation where distribution shift is common.
Online monitoring of R-EMID in deployed systems could trigger automatic retraining when user-shift contributions exceed a threshold.

Load-bearing premise

The co-evolving reinforcement learning framework correctly captures the joint dependencies among user, character, and dialogue context so that response generation probabilities can be estimated reliably enough to compute R-EMID.

What would settle it

Deploying the co-evolving RL framework on held-out role-playing dialogues and finding that the measured R-EMID value does not decrease relative to a non-RL baseline while actual response quality still degrades under user shifts.

Figures

Figures reproduced from arXiv: 2512.17270 by Fei Huang, Hao Lang, Tieyun Qian, Yongbin Li, Yongqi Li.

**Figure 2.** Figure 2: Pilot study of RPM performance under various distribution shifts, including user shift (left), character [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed CoRL method for estimating R-EMID, which includes two modules for reasoning generation and conditional probability estimation, respectively. The two modules are optimized alternately during the co-evolution. In practice, both the reasoning mapping function fR and the conditional distribution p(y|x, r) are inaccessible. Therefore, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Scatter plots with regression lines showing the correlations between information-theoretic metrics and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Trend plot showing the correlation between [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of RPMs trained with different methods, including supervised fine-tuning (SFT), data [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Maximum performance drop risk of RPMs trained with different methods. The x-axis denotes the shift [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of the RPGBench construction. First, we gather a diverse collection of agent characters [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Pilot study of RPM performance under various distribution shifts, including user shift (left), character [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Pilot study of RPM performance under various distribution shifts, including user shift (left), character [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: A case study comparing responses from different RPM training methods. This case is sampled from the [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R-EMID gives a practical diagnostic for shift-induced failures in role-playing LLMs, with user shifts flagged as the worst and RL as the best fix, but the upper bound's independence from training needs verification.

read the letter

The paper introduces R-EMID, an information-theoretic metric that quantifies performance drops in role-playing models from user, character, and dialogue shifts. It also derives an upper bound on this metric for worst-case prediction and adds a co-evolving RL framework to improve the response probability estimates that feed into the metric. The empirical takeaway is that user shifts create the largest risk and that RL works better than alternatives for mitigation. This matches what people see in deployed character systems, where unpredictable users break consistency faster than static character traits do. The framing is new enough in applying mutual information differences specifically to these RPM shifts rather than generic distribution shift work. The RL component is presented as a targeted way to model the joint user-character-context dependencies. The main soft spot is that the abstract supplies no derivation steps or error analysis for the upper bound, so it is hard to judge whether the bound stays independent of quantities already fitted during training or simply restates them. If the bound turns out to be circular, the predictive claim weakens. The full text would need to show the math clearly and include checks on how well the RL estimates hold up. This is for researchers working on reliable interactive LLMs or generalization diagnostics in applied settings. A reader who wants concrete metrics for character AI reliability would find the shift ranking and RL comparison useful. I would send it to peer review because the core idea is coherent and targets a real deployment problem, even if the theoretical details require checking.

Referee Report

2 major / 2 minor

Summary. The paper introduces an information-theoretic metric called reasoning-based effective mutual information difference (R-EMID) to quantify performance degradation in role-playing models (RPMs) due to distribution shifts in user, character, and dialogue contexts. It derives an upper bound on R-EMID to predict worst-case generalization performance and reveals contributions of various shifts. A co-evolving reinforcement learning framework is proposed to model connections among user, character, and dialogue to improve response generation probability estimates for R-EMID calculation. Empirical evaluations show that user shift poses the highest risk and that reinforcement learning is the most effective approach for enhancing RPM generalization.

Significance. If the derivations and empirical pipeline hold, this provides a formal, interpretable alternative to LLM-as-a-judge for diagnosing RPM generalization failures under specific shifts. The upper bound could enable worst-case predictions without exhaustive testing, and the co-evolving RL component directly targets the probability estimation bottleneck in the metric. The ranking of shift risks offers actionable guidance for deployment and training priorities in role-playing applications.

major comments (2)

The upper bound on R-EMID is claimed to predict worst-case performance and reveal how shifts contribute to degradation, but it is unclear whether the bound is independent of quantities optimized during the co-evolving RL training or reduces to fitted terms by construction (see abstract description of the bound and RL framework). This affects whether the theoretical revelation is non-circular.
The co-evolving RL framework is presented as accurately modeling connections among user, character, and dialogue context to enhance response probability estimates for R-EMID, but no derivation or validation is supplied showing that the joint modeling improves the mutual information difference calculation beyond standard RL baselines.

minor comments (2)

Abstract states the metric, upper bound, and empirical findings but supplies no derivation steps, error analysis, or dataset details, which hinders immediate assessment of soundness.
Empirical section should report concrete R-EMID values, confidence intervals, and ablation results for the user/character/dialogue shifts to support the claim that user shift poses the highest risk.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help strengthen the clarity of our theoretical and empirical contributions. We address each major point below, emphasizing the separation between the information-theoretic bound and the estimation procedure.

read point-by-point responses

Referee: The upper bound on R-EMID is claimed to predict worst-case performance and reveal how shifts contribute to degradation, but it is unclear whether the bound is independent of quantities optimized during the co-evolving RL training or reduces to fitted terms by construction (see abstract description of the bound and RL framework). This affects whether the theoretical revelation is non-circular.

Authors: The upper bound on R-EMID is obtained from standard information-theoretic inequalities (e.g., properties of KL divergence and mutual information) that hold for arbitrary probability distributions and do not reference any training objective or optimized parameters. The co-evolving RL procedure is introduced exclusively to produce more accurate estimates of the response-generation probabilities needed for the empirical computation of R-EMID; these estimates are plugged into the already-derived bound after training. Consequently, the bound itself remains independent of the RL optimization and the revelation of shift contributions is non-circular. We will add an explicit paragraph in the theoretical section of the revision to restate this separation. revision: partial
Referee: The co-evolving RL framework is presented as accurately modeling connections among user, character, and dialogue context to enhance response probability estimates for R-EMID, but no derivation or validation is supplied showing that the joint modeling improves the mutual information difference calculation beyond standard RL baselines.

Authors: The co-evolving RL alternates policy updates across the three context variables to capture their statistical dependencies, which single-agent RL baselines do not explicitly model. While a closed-form proof that this alternation strictly improves the mutual-information estimator is not supplied in the current manuscript, the experimental section reports consistent gains in both R-EMID reduction and downstream generalization metrics relative to standard RL and supervised fine-tuning baselines. We will include an additional subsection providing a concise information-theoretic motivation for the joint update rule together with further ablation results that isolate the benefit of co-evolution. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines R-EMID directly from information-theoretic quantities (mutual information differences under shifts) and derives its upper bound via standard bounding techniques on those quantities. The co-evolving RL component is introduced explicitly to improve the response probability estimates required for computing R-EMID, without any reduction of the bound or metric back to the RL parameters by construction. No self-citation is load-bearing for the central claims, no ansatz is smuggled, and no renaming of known results occurs. The empirical ranking of shifts and RL effectiveness follows from direct application of the independent metric and bound.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable. R-EMID appears to rest on standard information-theoretic definitions of mutual information and conditional probabilities.

pith-pipeline@v0.9.0 · 5510 in / 1017 out tokens · 21769 ms · 2026-05-16T21:07:12.751899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot

An information-theoretic approach to distribu- tion shifts.Advances in Neural Information Process- ing Systems, 34:17628–17641. Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 1042...

work page 2023
[2]

Najoung Kim and Tal Linzen

An information-theoretic evaluation of gener- ative models in learning multi-modal distributions. InThirty-seventh Conference on Neural Information Processing Systems. Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, et al. 2019. Measuring c...

work page arXiv 2019
[3]

9 Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bern- stein

Large language models are superpositions of all characters: Attaining arbitrary role-play via self- alignment.arXiv preprint arXiv:2401.12474. David JC MacKay. 2003.Information theory, infer- ence and learning algorithms. Cambridge university press. M. L. Menéndez, J. A. Pardo, L. Pardo, and M. C. Pardo

work page arXiv 2003
[4]

334(2):307– 318

The jensen-shannon divergence. 334(2):307– 318. Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, and Yixuan Li. 2025. Understanding multimodal llms under distribution shifts: An information-theoretic approach. InInternational Conference on Machine Learning. OpenAI. 2024. Gpt-4 technical report. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringe...

work page 2025
[5]

Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

Character-llm: A trainable agent for role- playing.arXiv preprint arXiv:2310.10158. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Ryan Shea and Zhou Yu. 202...

work page arXiv 2024
[6]

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haox- uan Li, Xu Chen, Xing Xie, and Ji-Rong Wen

Charactereval: A chinese benchmark for role-playing conversational agent evaluation.arXiv preprint arXiv:2401.01275. Leandro von Werra, Younes Belkada, Lewis Tun- stall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020. Trl: Trans- former reinforcement learning. https://github. com/huggingface/trl. Noah Wang, Zy Peng, Haoran Que, J...

work page arXiv 2020
[7]

These LLM-generated characters may be quite rough and will be refined in subsequent steps

Initial character collection:We use the seed persona information to construct prompts that guide an LLM, e.g.,claude3.7-sonnet,qwen3-235b-a22b, to generate characters from the domains of literature, film, theater, and gaming. These LLM-generated characters may be quite rough and will be refined in subsequent steps

work page
[8]

LLM with search engine

Web-based character information retrieval and refinement:Using the “LLM with search engine” functionality provided by the Qwen3 API platform, we retrieve online information related to each character generated in the previous step. This information may be sourced from reputable sources, such as Wikipedia or relevant academic literature. Based on the possib...

work page
[9]

The three prompt templates used for LLM-based text generation in the above steps are presented in Table 3

Character type classification:Additionally, to facilitate the construction of character shifts in the subsequent step, we also employ LLMs to determine whether each character is real-world or fictional. The three prompt templates used for LLM-based text generation in the above steps are presented in Table 3. Step 2: User Persona CollectionIn Step 1, we ob...

work page
[10]

Dialogue shift Figure 8: Overview of the RPGBench construction. First, we gather a diverse collection of agent characters spanning multiple domains (Step 1), user personas with diverse linguistic backgrounds (Step 2), and a variety of dialogues (Step 3). InStep 4, for the collected samples, we construct distribution shifts based on: 1) user shift from Eng...

work page 2018
[11]

ID dialogues

of RPMs, we further construct composed dialogues via turn-level and word-level compositions. These composed samples reflect real-world challenges of dialogue systems such as long-context under- standing (Bai et al., 2023) and nonlinear conversations. For the turn-level shift, we prompt the LLM to combine two given 4-turn ID dialogues into a single 8-turn ...

work page 2023
[12]

**Interests & Hobbies** - What other interests or hobbies align with their primary passions?

work page
[13]

**Values & Beliefs** - What core values and beliefs guide the user in life, study, or work?

work page
[14]

**Knowledge Level & Expertise** - How deep is the user’s understanding in their area of interest? Do they have any specialized knowledge or skills?

work page
[15]

**Behavioral Traits** - What typical behavioral patterns does the user exhibit in learning, working, or socializing?

work page
[16]

**Communication Style** - How does the user typically communicate? Formally, casually, analytically, emotionally?

work page
[17]

‘json {{

**Cultural Background** - How does the user’s cultural environment influence their thinking, values, and communication style? The original user persona: {step1_ReversedUserPersona} Make sure the expanded user persona is in the same language as the original. For example, if the original user persona is de- scribed in German/Spanish/Chinese/Japanese/Korean,...

work page
[18]

{user_persona}

It causes the **maximum change in the expected response** with the **least change to the input** (i.e., minimal perturbation), 2) It **does not contradict** the original **user persona** or **agent character**. Given the following information: - **User Persona**: "{user_persona}" - **Agent Character Profile**: ”’ {agent_character} ”’ - **Dialogue Context*...

work page 1978
[19]

users are likely to ask queries related to their own user information

The maximum entropy term is modified to account for the reasoning process variable r, yielding bH= max x∈X [H(Q Y|x,r ) +H(P Yθ|x,r)]; 2) We further analyze and refine the term p 2DJS(PX ∥QX) to better align with the input components in role-playing settings. First, we begin with a symmetric decomposition of the joint KL divergence by applying the chain r...

work page 2003
[20]

rollout of the reasoning generator

the GRPO loss; 2) prompt templates for rollout sampling and reward computation in reinforcement learning; 3) details on the collection of reasoning processes used for SFT initialization of the two modules; and 4) other implementation details, such as training hyperparameters. Group Relative Policy Optimization (GRPO)Following the standard Group Relative P...

work page 2024
[21]

{user_persona}

as the RL algorithm. The RL training is initialized from the SFT checkpoint. The reward computation follows the same approach as in RFT. The prompt template used for rollouts is the same as in SFT, as shown in Table 17. The number of rollouts is set to 8. • SFT with Thinking modes (ThinkingSFT)(Tang et al., 2025): The key difference between ThinkingSFT an...

work page 2025
[22]

Your responses should be closely aligned with both the user persona and the agent’s character, as different users and agent profiles can lead to very different conversations

work page
[23]

Provide **three distinct possibilities**, each of which must adhere to the requirements in point 1 and 2

work page
[24]

‘json {{

Output the result **strictly in the JSON format** shown below. ### Output Format: “‘json {{ "possibility_1": "", "possibility_2": "", "possibility_3": "" }} “‘ Prompt Template (step 2) You are tasked with generating a realistic 4-round dialogue between a user and an agent. Given the following information: - **User Persona**: "{user_persona}" - **Chat Hist...

work page

[1] [1]

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot

An information-theoretic approach to distribu- tion shifts.Advances in Neural Information Process- ing Systems, 34:17628–17641. Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 1042...

work page 2023

[2] [2]

Najoung Kim and Tal Linzen

An information-theoretic evaluation of gener- ative models in learning multi-modal distributions. InThirty-seventh Conference on Neural Information Processing Systems. Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, et al. 2019. Measuring c...

work page arXiv 2019

[3] [3]

9 Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bern- stein

Large language models are superpositions of all characters: Attaining arbitrary role-play via self- alignment.arXiv preprint arXiv:2401.12474. David JC MacKay. 2003.Information theory, infer- ence and learning algorithms. Cambridge university press. M. L. Menéndez, J. A. Pardo, L. Pardo, and M. C. Pardo

work page arXiv 2003

[4] [4]

334(2):307– 318

The jensen-shannon divergence. 334(2):307– 318. Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, and Yixuan Li. 2025. Understanding multimodal llms under distribution shifts: An information-theoretic approach. InInternational Conference on Machine Learning. OpenAI. 2024. Gpt-4 technical report. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringe...

work page 2025

[5] [5]

Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

Character-llm: A trainable agent for role- playing.arXiv preprint arXiv:2310.10158. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Ryan Shea and Zhou Yu. 202...

work page arXiv 2024

[6] [6]

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haox- uan Li, Xu Chen, Xing Xie, and Ji-Rong Wen

Charactereval: A chinese benchmark for role-playing conversational agent evaluation.arXiv preprint arXiv:2401.01275. Leandro von Werra, Younes Belkada, Lewis Tun- stall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020. Trl: Trans- former reinforcement learning. https://github. com/huggingface/trl. Noah Wang, Zy Peng, Haoran Que, J...

work page arXiv 2020

[7] [7]

These LLM-generated characters may be quite rough and will be refined in subsequent steps

Initial character collection:We use the seed persona information to construct prompts that guide an LLM, e.g.,claude3.7-sonnet,qwen3-235b-a22b, to generate characters from the domains of literature, film, theater, and gaming. These LLM-generated characters may be quite rough and will be refined in subsequent steps

work page

[8] [8]

LLM with search engine

Web-based character information retrieval and refinement:Using the “LLM with search engine” functionality provided by the Qwen3 API platform, we retrieve online information related to each character generated in the previous step. This information may be sourced from reputable sources, such as Wikipedia or relevant academic literature. Based on the possib...

work page

[9] [9]

The three prompt templates used for LLM-based text generation in the above steps are presented in Table 3

Character type classification:Additionally, to facilitate the construction of character shifts in the subsequent step, we also employ LLMs to determine whether each character is real-world or fictional. The three prompt templates used for LLM-based text generation in the above steps are presented in Table 3. Step 2: User Persona CollectionIn Step 1, we ob...

work page

[10] [10]

Dialogue shift Figure 8: Overview of the RPGBench construction. First, we gather a diverse collection of agent characters spanning multiple domains (Step 1), user personas with diverse linguistic backgrounds (Step 2), and a variety of dialogues (Step 3). InStep 4, for the collected samples, we construct distribution shifts based on: 1) user shift from Eng...

work page 2018

[11] [11]

ID dialogues

of RPMs, we further construct composed dialogues via turn-level and word-level compositions. These composed samples reflect real-world challenges of dialogue systems such as long-context under- standing (Bai et al., 2023) and nonlinear conversations. For the turn-level shift, we prompt the LLM to combine two given 4-turn ID dialogues into a single 8-turn ...

work page 2023

[12] [12]

**Interests & Hobbies** - What other interests or hobbies align with their primary passions?

work page

[13] [13]

**Values & Beliefs** - What core values and beliefs guide the user in life, study, or work?

work page

[14] [14]

**Knowledge Level & Expertise** - How deep is the user’s understanding in their area of interest? Do they have any specialized knowledge or skills?

work page

[15] [15]

**Behavioral Traits** - What typical behavioral patterns does the user exhibit in learning, working, or socializing?

work page

[16] [16]

**Communication Style** - How does the user typically communicate? Formally, casually, analytically, emotionally?

work page

[17] [17]

‘json {{

**Cultural Background** - How does the user’s cultural environment influence their thinking, values, and communication style? The original user persona: {step1_ReversedUserPersona} Make sure the expanded user persona is in the same language as the original. For example, if the original user persona is de- scribed in German/Spanish/Chinese/Japanese/Korean,...

work page

[18] [18]

{user_persona}

It causes the **maximum change in the expected response** with the **least change to the input** (i.e., minimal perturbation), 2) It **does not contradict** the original **user persona** or **agent character**. Given the following information: - **User Persona**: "{user_persona}" - **Agent Character Profile**: ”’ {agent_character} ”’ - **Dialogue Context*...

work page 1978

[19] [19]

users are likely to ask queries related to their own user information

The maximum entropy term is modified to account for the reasoning process variable r, yielding bH= max x∈X [H(Q Y|x,r ) +H(P Yθ|x,r)]; 2) We further analyze and refine the term p 2DJS(PX ∥QX) to better align with the input components in role-playing settings. First, we begin with a symmetric decomposition of the joint KL divergence by applying the chain r...

work page 2003

[20] [20]

rollout of the reasoning generator

the GRPO loss; 2) prompt templates for rollout sampling and reward computation in reinforcement learning; 3) details on the collection of reasoning processes used for SFT initialization of the two modules; and 4) other implementation details, such as training hyperparameters. Group Relative Policy Optimization (GRPO)Following the standard Group Relative P...

work page 2024

[21] [21]

{user_persona}

as the RL algorithm. The RL training is initialized from the SFT checkpoint. The reward computation follows the same approach as in RFT. The prompt template used for rollouts is the same as in SFT, as shown in Table 17. The number of rollouts is set to 8. • SFT with Thinking modes (ThinkingSFT)(Tang et al., 2025): The key difference between ThinkingSFT an...

work page 2025

[22] [22]

Your responses should be closely aligned with both the user persona and the agent’s character, as different users and agent profiles can lead to very different conversations

work page

[23] [23]

Provide **three distinct possibilities**, each of which must adhere to the requirements in point 1 and 2

work page

[24] [24]

‘json {{

Output the result **strictly in the JSON format** shown below. ### Output Format: “‘json {{ "possibility_1": "", "possibility_2": "", "possibility_3": "" }} “‘ Prompt Template (step 2) You are tasked with generating a realistic 4-round dialogue between a user and an agent. Given the following information: - **User Persona**: "{user_persona}" - **Chat Hist...

work page