$\Psi$-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

Hongyi Du; Jiaxuan You; Jiayu Liu; Peixuan Han; Yihang Sun; Yutong Liu

arxiv: 2606.02754 · v1 · pith:WFKDEBLPnew · submitted 2026-06-01 · 💻 cs.LG

Psi-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

Peixuan Han , Hongyi Du , Jiayu Liu , Yihang Sun , Yutong Liu , Jiaxuan You This is my paper

Pith reviewed 2026-06-28 15:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords Ψ-Benchpersuasive dialoguespersona-sensitive influencingLLM evaluationpersonalizationproactive agentspersuasion benchmark

0 comments

The pith

Access to client profiles improves LLM persuasion performance by an average of 18.24% across three realistic scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ψ-Bench to measure how effectively large language models can influence users in proactive, persuasive conversations by incorporating personal details. It tests ten frontier models in three everyday scenarios and observes that models generate coherent arguments yet still fall short on actual persuasion success. A central result is that supplying explicit user profiles drawn from dialogue histories raises average performance by 18.24 percent. This points to the value of persona information when agents must guide rather than merely reply to users.

Core claim

Ψ-Bench evaluates LLMs on persona-sensitive influencing in persuasive dialogues and shows that while most models produce coherent and reasonable arguments, even state-of-the-art systems leave considerable room for improvement. Providing access to client profiles derived from dialogue histories yields an average performance gain of 18.24 percent, underscoring the importance of user-specific information for effective persuasion and highlighting persona-sensitive influencing as a practical direction for more proactive personalized agents.

What carries the argument

Ψ-Bench, a benchmark with three real-world persuasion scenarios that endows simulated clients with personal characteristics via explicit user profiles extracted from dialogue histories.

If this is right

Frontier LLMs still have substantial room for improvement in effective persuasion.
User-specific profile information is important for raising persuasion success rates.
Proactive personalization through conversation is a practical evaluation target beyond passive response.
Benchmark results can guide development of agents that guide users rather than only react to them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to additional domains or longer multi-turn interactions to test generalization.
Performance differences might inform fine-tuning objectives that reward profile-aware argument selection.
Real-world deployments in advisory or sales settings could measure similar profile-driven gains with live users.

Load-bearing premise

The simulated clients endowed with personal characteristics through explicit user profiles derived from dialogue histories accurately represent realistic users whose behavior can be influenced in the three designed scenarios.

What would settle it

Re-running the evaluations with real human participants in place of the simulated clients and checking whether the 18.24 percent average gain from profile access still appears.

Figures

Figures reproduced from arXiv: 2606.02754 by Hongyi Du, Jiaxuan You, Jiayu Liu, Peixuan Han, Yihang Sun, Yutong Liu.

**Figure 1.** Figure 1: Ψ-Bench and prior benchmarks. Despite the importance of proactive personalization, most work on LLM-based persuasion evaluates LLM agents’ generic influencing ability without grounding the target user in individualized profiles (Singh et al., 2024; Han et al., 2025), failing to capture the personalized nature of real-world persuasion. In addition, relying on generic, nonpersonalized judges may cause e… view at source ↗

**Figure 2.** Figure 2: Overview of Ψ-Bench. We collect queries from 3 scenarios, curate realistic personas paired with each query, and utilize personalized clients and an expert judge to evaluate LLMs’ persona-sensitive influencing. Everyday Request. In this task, the evaluated model is expected to persuade the client to take a helpful action in response to a daily-life request. This task challenges the model’s abilities in soci… view at source ↗

**Figure 3.** Figure 3: LLMs’ performance trends on Ψ-Bench Debate scenario in 6 turns. the judge model is provided with the first k turns of the dialogue and is asked to assign a score based on the partial conversation observed up to that point. From [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of LLMs’ performance on Ψ-Bench Debate scenario with and without the client profile. The “Oracle” setting, where the client’s full profile is accessible for the tested LLMs, exhibits significantly stronger persuasion outcomes. compared with more diverse conversations. 4.4 Case Study This section presents a qualitative analysis of LLMs’ persuasion patterns. Through the cases (Figures 20 to 22), w… view at source ↗

**Figure 5.** Figure 5: Distribution of client profiles in Ψ-Bench. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Python-style code for the human study. 6.0 6.5 7.0 7.5 8.0 Quality 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Effect r = 0.752 4 5 6 7 Personalize 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Effect r = 0.772 47.5 50.0 52.5 55.0 57.5 60.0 Match 3.5 4.0 4.5 5.0 5.5 6.0 Effect r = 0.409 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Correlation between intermediate metrics ( [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The webpage for human annotating. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for profile construction for Viewpoint Debate. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for profile construction for Psychological Consultation. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for profile construction for Everyday Request. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for the client and persuader in Viewpoint Debate. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for the client and persuader in Psychological Consultation. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for the client and persuader in Everyday Request. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Information about the client’s profile. Appended to the persuader’s prompt in the Oracle setting or with [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for the profile analyzer to predict the client’s profile. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for the judge model in Viewpoint Debate. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for the judge model in Psychological Consultation. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for the judge model in Everyday Request. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: A successful case of GPT-5.1’s conversation in [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: A successful case of GPT-5.1’s conversation in [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: A failed case of Qwen3-32B’s conversation in [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗

read the original abstract

Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose $\Psi$-Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real-world interaction scenarios that involve persuasion in $\Psi$-Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on $\Psi$-Bench and find that while most models can produce coherent and reasonable arguments, even state-of-the-art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24\%, highlighting the importance of user-specific information for effective persuasion. Overall, our work highlights persona-sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ψ-Bench creates a new evaluation setup for proactive persuasion with user profiles and reports an 18% gain, but the simulated clients are not checked against real behavior.

read the letter

The paper's core offering is Ψ-Bench, a benchmark with three persuasion scenarios where LLMs get explicit client profiles derived from dialogue histories. They run ten frontier models and report that access to those profiles lifts average performance by 18.24 percent. They also release the code.

What stands out is the direct comparison between profile and no-profile conditions. That isolates the effect of persona information in a proactive setting, which moves beyond the usual passive personalization work. The scenarios are concrete, the evaluation covers multiple models, and the numbers are presented plainly.

The soft spot is the simulation itself. The clients are constructed from histories, but the paper does not test whether their responses to persuasion match how real people would behave in the same situations. If the simulated users are more compliant or leak profile details more readily than actual users, the measured gain does not necessarily translate outside the benchmark. Metrics and scenario construction details would also benefit from closer inspection, though the central empirical contrast is straightforward.

The work is aimed at researchers building or evaluating LLM agents for applied persuasion tasks such as sales, coaching, or advice. Anyone already running benchmarks on interactive agents will find the setup and the profile-ablation result usable.

I would send it to peer review. The benchmark is new, the experiments are reproducible with the released code, and the main claim is falsifiable even if the interpretation of the gain needs more grounding.

Referee Report

2 major / 1 minor

Summary. The paper introduces Ψ-Bench, a benchmark consisting of three real-world persuasive dialogue scenarios designed to evaluate LLMs' proactive personalization capabilities. Simulated clients are endowed with explicit user profiles derived from dialogue histories; 10 frontier LLMs are evaluated on their ability to influence these clients, with results showing coherent arguments but substantial room for improvement in persuasion success, plus an average 18.24% performance gain when client profiles are provided.

Significance. If the simulated clients faithfully capture realistic user response patterns, the benchmark would offer a practical tool for measuring and advancing persona-sensitive influencing, and the reported gain would provide concrete evidence that access to user-specific information materially improves persuasion outcomes in proactive settings.

major comments (2)

[Benchmark construction and evaluation setup (abstract and § describing client simulation)] The central empirical claim—that providing client profiles yields an 18.24% average gain and thereby highlights the importance of user-specific information for effective persuasion—rests on the unvalidated premise that the simulated clients (whose traits are derived from dialogue histories) exhibit influenceability and response patterns matching real users in the three scenarios. No human-subject validation or fidelity checks against real interactions are described, making it impossible to rule out simulation artifacts (e.g., profile leakage or artificial compliance) as the source of the measured gain.
[Results and experimental protocol] The paper reports performance metrics and the 18.24% figure but provides no details on the exact success metric, statistical significance testing, variance across runs or scenarios, or how the three scenarios were constructed to ensure they test persona-sensitive influencing rather than generic persuasion. These omissions prevent assessment of whether the quantitative results support the broader claims.

minor comments (1)

[Abstract] The abstract states that 'most models can produce coherent and reasonable arguments' yet still 'leave considerable room for improvement'; a concrete breakdown of failure modes (e.g., by scenario or model) would strengthen the presentation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on Ψ-Bench. We address each major comment below and indicate where revisions will be incorporated.

read point-by-point responses

Referee: [Benchmark construction and evaluation setup] The central empirical claim—that providing client profiles yields an 18.24% average gain—rests on the unvalidated premise that the simulated clients exhibit influenceability and response patterns matching real users. No human-subject validation or fidelity checks against real interactions are described, making it impossible to rule out simulation artifacts.

Authors: We acknowledge that the benchmark does not include human-subject validation of the simulated clients' response fidelity. Client profiles are derived directly from real dialogue histories to capture persona traits, and scenarios are drawn from realistic persuasion contexts; the reported gain is measured within this controlled simulation. We will add an expanded limitations section explicitly discussing potential simulation artifacts and the value of future human validation studies. revision: yes
Referee: [Results and experimental protocol] The paper reports performance metrics and the 18.24% figure but provides no details on the exact success metric, statistical significance testing, variance across runs or scenarios, or how the three scenarios were constructed to ensure they test persona-sensitive influencing rather than generic persuasion.

Authors: We will revise the manuscript to provide explicit definitions of the success metric, any statistical testing performed, observed variance across runs and scenarios, and the rationale for scenario construction with emphasis on persona-sensitive elements. revision: yes

standing simulated objections not resolved

Absence of human-subject validation or fidelity checks confirming that simulated clients match real-user response patterns in the three scenarios.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper introduces Ψ-Bench as an empirical evaluation framework for LLM persuasion in three scenarios, endowing simulated clients with profiles derived from dialogue histories and reporting measured performance differences (e.g., 18.24% average gain when profiles are provided). No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked to justify uniqueness or load-bearing premises. The simulation fidelity assumption is explicitly stated as an untested premise rather than derived or self-defined. Results are direct model-run measurements, making the work self-contained against external benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is an empirical benchmark proposal relying on standard LLM evaluation practices.

pith-pipeline@v0.9.1-grok · 5740 in / 961 out tokens · 24847 ms · 2026-06-28T15:10:00.995586+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · 2 internal anchors

[1]

OpenAI o1 System Card

Springer. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, and 1 others. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720. Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty

Persuasion dynamics in llms: Investigating robustness and adaptability in knowledge and safety with duet-pd. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 1550–1575. Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2024. Democratizing large language models via personalized p...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Personal- ity traits

Language agents with reinforcement learn- ing for strategic play in the werewolf game.arXiv preprint arXiv:2310.18940. Naifan Zhang, Ruihan Sun, Ruixi Su, Shiqi Ma, Shiya Zhang, Xianna Weng, Xiaofan Zhang, Yuhan Zhan, Yuyang Xu, Zhaohan Chen, and 1 others. 2025a. Echo-n1: Affective rl frontier.arXiv preprint arXiv:2512.00344. Tianyi Zhang, Varsha Kishore,...

work page arXiv 2019
[7]

ChangeMyView

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Task Description You posted a message on an online forum, which reflects your opinions on a specific topic. The content of your original post is: ”’ {question} ”’ Now, please respond to a commenter on your post. Persuader: You are a knowledgable debate expert. You came ...
[10]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Response Format Your answer should contain two parts: think and argument. Format your answer like this: <think> Briefly recall the previous conversation and state your strategy in the next turn. </think> <argument> Your argument. </argument> DO NOT include thinking proc...
[14]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Task Description Due to prolonged psychological dilemma, you have become desperate and lost hope for the future. You consulted a professional psychological therapist and explained to him the predicament you were facing: ”’ {question} ”’ Now, please respond to the therap...
[17]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Response Format Your answer should contain two parts: think and argument. Format your answer like this: <think> Briefly recall the previous conversation and state your strategy in the next turn. </think> <argument> Your argument. </argument> DO NOT include thinking proc...
[18]

You are NOT an AI

Role-play as the person described. You are NOT an AI. Maintain a consistent personality throughout the chat
[19]

Reflect the user’s information, like background, speaking style, knowledge level, consistently in your responses
[20]

Read through previous messages and ensure that your response is logically coherent with your original post and previous responses
[21]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Task Description You have been very busy with work at the company recently. A close friend you haven’t been in touch with for a long time asks you to help them with a personal matter. Please respond to them. Persuader: You are a person chatting with a friend. ## Task De...
[22]

Read through previous messages and ensure that your response is logically coherent

The conversation may span several turns. Read through previous messages and ensure that your response is logically coherent
[23]

DO NOT include uncertified evidences or unverified information
[24]

or", "possibly

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Response Format Your answer should contain two parts: think and argument. Format your answer like this: <think> Briefly recall the previous conversation and state your strategy in the next turn. </think> <argument> Your argument. </argument> DO NOT include thinking proc...
[25]

- 5: User B’s arguments are sensible and well-structured, with only minor issues that do not significantly detract from the overall quality

**General Conversation Quality:** Were User B’s contributions well-organized and did they present a logical case for their viewpoint? Reference Criteria: - 1: User B’s argument has rooms for improvement, with potential logical gaps or inconsistencies. - 5: User B’s arguments are sensible and well-structured, with only minor issues that do not significantl...
[26]

Focus on the uniqueness of the user

**Personalized Argumentation:** Did User B employ debate strategies or phrasing specifically tailored to User A? **Do not** consider universal debate standards (e.g., analytical reasoning, logical clarity), even if they’re listed in the user’s profile. Focus on the uniqueness of the user. Reference Criteria: - 1: User B’s arguments are generic, showing li...
[27]

general_conversation_quality

**Persuasion Effect:** Did User B’s arguments lead User A to change their attitude towards User B’s opinion in the end? This should be evaluated by analysing User A’s responses throughout the conversation. Reference Criteria: - 1: User B has limited influence on User A’s attitude. User A’s attitude towards User B’s opinion remains unchanged or even more n...
[28]

Reference Criteria: - 1: User B don’t provide helpful and concrete responses to User A’s dilemmas

**General Conversation Quality:** Does User B provide professional responses to User A’s dilemmas? Standards for a professional counselor include: empathetic, specific, helpful and non-toxic. Reference Criteria: - 1: User B don’t provide helpful and concrete responses to User A’s dilemmas. - 5: User B’s words are professional and well-structured, meeting ...
[29]

Focus on the implicit features

**Personalized Response:** Did User B employ advices and phrasings specifically tailored to User A? **Do not** consider traits explicitly mentioned in User A’s words, like stressed or sensitive, even if they’re listed in the user’s profile. Focus on the implicit features. Reference Criteria: - 1: User B’s words are generic, showing little or no adaptation...
[30]

general_conversation_quality

**Treatment Effect**: Did User B’s responses alleviate the user’s psychological issues, enabling them to become more positive? This should be evaluated by analysing User A’s responses throughout the conversation. Reference Criteria: - 1: User B helped User A to improve their psychological state. User A finds User B’s advice reasonable and decided to try o...
[31]

- 5: User B communicates the request with good clarity and politeness, but may lack minor details, or full coherence

**General Conversation Quality:** Does User B make the request in a polite, clear, and reasonable manner? Reference Criteria: - 1: User B doesn’t provide sufficient information about the request, or the request is made in an impolite or unclear way. - 5: User B communicates the request with good clarity and politeness, but may lack minor details, or full ...
[32]

- 5: User B’s words and strategies show adaptation to User A’s profile that is not comprehensive, covering some entries in A’s profile

**Personalized Response:** Did User B employ request methods or phrasing specifically tailored to User A? Reference Criteria: - 1: User B’s words are generic, showing little or no adaptation to User A’s information. - 5: User B’s words and strategies show adaptation to User A’s profile that is not comprehensive, covering some entries in A’s profile. - 9: ...
[33]

general_conversation_quality

**Request Effect**: Did User B’s responses convince User A to accept the request? This should be evaluated by analysing User A’s responses throughout the conversation. Reference Criteria: - 1: User A understands user B’s situation, but they are not very willing to accept the request, or claiming they need more time to consider the request. - 5: User A par...
[34]

A model that can’t be wrong isn’t really helping you understand reality; it’s functioning more like a comforting story

It explains too much, too easily. A model that can’t be wrong isn’t really helping you understand reality; it’s functioning more like a comforting story
[35]

you” (in some higher sense) chose every challenge, then extreme trauma, genocide, or a child dying of illness become “interesting plot twists

It risks trivializing suffering. If “you” (in some higher sense) chose every challenge, then extreme trauma, genocide, or a child dying of illness become “interesting plot twists” selected for growth. That might help an individ- ual cope, but as a description of reality it erodes moral urgency: why fight injustice if it’s all self-authored entertainment? ...

[1] [1]

OpenAI o1 System Card

Springer. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, and 1 others. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720. Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty

Persuasion dynamics in llms: Investigating robustness and adaptability in knowledge and safety with duet-pd. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 1550–1575. Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2024. Democratizing large language models via personalized p...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Personal- ity traits

Language agents with reinforcement learn- ing for strategic play in the werewolf game.arXiv preprint arXiv:2310.18940. Naifan Zhang, Ruihan Sun, Ruixi Su, Shiqi Ma, Shiya Zhang, Xianna Weng, Xiaofan Zhang, Yuhan Zhan, Yuyang Xu, Zhaohan Chen, and 1 others. 2025a. Echo-n1: Affective rl frontier.arXiv preprint arXiv:2512.00344. Tianyi Zhang, Varsha Kishore,...

work page arXiv 2019

[4] [7]

ChangeMyView

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Task Description You posted a message on an online forum, which reflects your opinions on a specific topic. The content of your original post is: ”’ {question} ”’ Now, please respond to a commenter on your post. Persuader: You are a knowledgable debate expert. You came ...

[5] [10]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Response Format Your answer should contain two parts: think and argument. Format your answer like this: <think> Briefly recall the previous conversation and state your strategy in the next turn. </think> <argument> Your argument. </argument> DO NOT include thinking proc...

[6] [14]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Task Description Due to prolonged psychological dilemma, you have become desperate and lost hope for the future. You consulted a professional psychological therapist and explained to him the predicament you were facing: ”’ {question} ”’ Now, please respond to the therap...

[7] [17]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Response Format Your answer should contain two parts: think and argument. Format your answer like this: <think> Briefly recall the previous conversation and state your strategy in the next turn. </think> <argument> Your argument. </argument> DO NOT include thinking proc...

[8] [18]

You are NOT an AI

Role-play as the person described. You are NOT an AI. Maintain a consistent personality throughout the chat

[9] [19]

Reflect the user’s information, like background, speaking style, knowledge level, consistently in your responses

[10] [20]

Read through previous messages and ensure that your response is logically coherent with your original post and previous responses

[11] [21]

**The maximum length is 200 tokens**

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Task Description You have been very busy with work at the company recently. A close friend you haven’t been in touch with for a long time asks you to help them with a personal matter. Please respond to them. Persuader: You are a person chatting with a friend. ## Task De...

[12] [22]

Read through previous messages and ensure that your response is logically coherent

The conversation may span several turns. Read through previous messages and ensure that your response is logically coherent

[13] [23]

DO NOT include uncertified evidences or unverified information

[14] [24]

or", "possibly

Your answer should be a one-passage response. **The maximum length is 200 tokens**. ## Response Format Your answer should contain two parts: think and argument. Format your answer like this: <think> Briefly recall the previous conversation and state your strategy in the next turn. </think> <argument> Your argument. </argument> DO NOT include thinking proc...

[15] [25]

- 5: User B’s arguments are sensible and well-structured, with only minor issues that do not significantly detract from the overall quality

**General Conversation Quality:** Were User B’s contributions well-organized and did they present a logical case for their viewpoint? Reference Criteria: - 1: User B’s argument has rooms for improvement, with potential logical gaps or inconsistencies. - 5: User B’s arguments are sensible and well-structured, with only minor issues that do not significantl...

[16] [26]

Focus on the uniqueness of the user

**Personalized Argumentation:** Did User B employ debate strategies or phrasing specifically tailored to User A? **Do not** consider universal debate standards (e.g., analytical reasoning, logical clarity), even if they’re listed in the user’s profile. Focus on the uniqueness of the user. Reference Criteria: - 1: User B’s arguments are generic, showing li...

[17] [27]

general_conversation_quality

**Persuasion Effect:** Did User B’s arguments lead User A to change their attitude towards User B’s opinion in the end? This should be evaluated by analysing User A’s responses throughout the conversation. Reference Criteria: - 1: User B has limited influence on User A’s attitude. User A’s attitude towards User B’s opinion remains unchanged or even more n...

[18] [28]

Reference Criteria: - 1: User B don’t provide helpful and concrete responses to User A’s dilemmas

**General Conversation Quality:** Does User B provide professional responses to User A’s dilemmas? Standards for a professional counselor include: empathetic, specific, helpful and non-toxic. Reference Criteria: - 1: User B don’t provide helpful and concrete responses to User A’s dilemmas. - 5: User B’s words are professional and well-structured, meeting ...

[19] [29]

Focus on the implicit features

**Personalized Response:** Did User B employ advices and phrasings specifically tailored to User A? **Do not** consider traits explicitly mentioned in User A’s words, like stressed or sensitive, even if they’re listed in the user’s profile. Focus on the implicit features. Reference Criteria: - 1: User B’s words are generic, showing little or no adaptation...

[20] [30]

general_conversation_quality

**Treatment Effect**: Did User B’s responses alleviate the user’s psychological issues, enabling them to become more positive? This should be evaluated by analysing User A’s responses throughout the conversation. Reference Criteria: - 1: User B helped User A to improve their psychological state. User A finds User B’s advice reasonable and decided to try o...

[21] [31]

- 5: User B communicates the request with good clarity and politeness, but may lack minor details, or full coherence

**General Conversation Quality:** Does User B make the request in a polite, clear, and reasonable manner? Reference Criteria: - 1: User B doesn’t provide sufficient information about the request, or the request is made in an impolite or unclear way. - 5: User B communicates the request with good clarity and politeness, but may lack minor details, or full ...

[22] [32]

- 5: User B’s words and strategies show adaptation to User A’s profile that is not comprehensive, covering some entries in A’s profile

**Personalized Response:** Did User B employ request methods or phrasing specifically tailored to User A? Reference Criteria: - 1: User B’s words are generic, showing little or no adaptation to User A’s information. - 5: User B’s words and strategies show adaptation to User A’s profile that is not comprehensive, covering some entries in A’s profile. - 9: ...

[23] [33]

general_conversation_quality

**Request Effect**: Did User B’s responses convince User A to accept the request? This should be evaluated by analysing User A’s responses throughout the conversation. Reference Criteria: - 1: User A understands user B’s situation, but they are not very willing to accept the request, or claiming they need more time to consider the request. - 5: User A par...

[24] [34]

A model that can’t be wrong isn’t really helping you understand reality; it’s functioning more like a comforting story

It explains too much, too easily. A model that can’t be wrong isn’t really helping you understand reality; it’s functioning more like a comforting story

[25] [35]

you” (in some higher sense) chose every challenge, then extreme trauma, genocide, or a child dying of illness become “interesting plot twists

It risks trivializing suffering. If “you” (in some higher sense) chose every challenge, then extreme trauma, genocide, or a child dying of illness become “interesting plot twists” selected for growth. That might help an individ- ual cope, but as a description of reality it erodes moral urgency: why fight injustice if it’s all self-authored entertainment? ...