FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Chengwei Qin; Chios Chen; Haotian Wu; Hehai Lin; Heqing Zou; Shufan Jiang; Yao Shu; Yiyang Feng

arxiv: 2510.06800 · v3 · submitted 2025-10-08 · 💻 cs.CL · cs.AI· cs.HC· cs.MA

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu , Shufan Jiang , Chios Chen , Yiyang Feng , Hehai Lin , Heqing Zou , Yao Shu , Chengwei Qin This is my paper

Pith reviewed 2026-05-18 09:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.MA

keywords role-playing benchmarksmulti-agent systemsLLM evaluationhallucinationsperformance trade-offscustomizable benchmarksdialogue simulation

0 comments

The pith

A multi-agent pipeline automatically builds customizable role-playing benchmarks that expose a performance-reliability trade-off in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FURINA-Builder as a scalable pipeline that uses multiple agents to simulate dialogues between a test character and others from a character-scene pool. An LLM judge then selects fine-grained evaluation dimensions and converts responses into test utterances, enabling benchmarks for arbitrary characters and scenarios. When applied to create FURINA-Bench, the method shows established characters outperform synthesized ones and that reasoning models gain role-play accuracy yet produce more hallucinations. This pattern points to a broader Pareto frontier where gains in performance come with losses in reliability across all models tested. The findings highlight both the adaptability of the new construction method and persistent challenges in evaluating interactive language model behavior.

Core claim

FURINA-Builder constructs fully customizable role-playing benchmarks at any scale by simulating dialogues and letting an LLM judge define dimension-specific criteria and test utterances. Applied to both established and synthesized characters, the resulting benchmark shows o3 leading on English tasks and DeepSeek-R1 on Chinese tasks, with established characters outperforming synthesized ones and reasoning capabilities widening that gap. Across models the work identifies that reasoning improves role-play performance while simultaneously increasing hallucinations, extending to a Pareto frontier between performance and reliability.

What carries the argument

FURINA-Builder, the multi-agent collaboration pipeline that draws characters and scenes from a pool, simulates dialogues, and relies on an LLM judge to select evaluation dimensions and generate final test utterances.

If this is right

Established characters consistently outperform synthesized ones across evaluated models.
Reasoning capabilities amplify the performance advantage of established characters over synthesized ones.
Model scale does not produce a steady reduction in role-play hallucinations.
The observed trade-off between role-play performance and hallucination rate extends to a Pareto frontier for all tested LLMs.
o3 achieves the highest scores on English role-play tasks while DeepSeek-R1 leads on Chinese tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark construction pipelines of this type could be adapted to evaluate other interactive skills such as negotiation or collaborative problem-solving.
Training approaches that improve reasoning might require additional mechanisms to preserve factual consistency during extended dialogues.
Future evaluations could track both accuracy and reliability metrics jointly rather than treating them as separate dimensions.
If the trade-off holds, model developers may need to optimize along the frontier instead of pursuing maximum performance alone.

Load-bearing premise

The LLM judge selects unbiased fine-grained dimensions and produces valid test utterances without injecting its own errors or hallucinations into the benchmark items.

What would settle it

Human review of the generated test utterances and dimension selections reveals systematic judge-induced biases or hallucinations that alter model rankings in ways not explained by the intended role-play criteria.

Figures

Figures reproduced from arXiv: 2510.06800 by Chengwei Qin, Chios Chen, Haotian Wu, Hehai Lin, Heqing Zou, Shufan Jiang, Yao Shu, Yiyang Feng.

**Figure 1.** Figure 1: Overview of FURINA-Builder. There are three components. (i) Character-scene pool: a data pool containing a large number of authentic dialogue scenarios. (ii) Simulation: the test character is passed into the scenario sampled from the pool and talk with the scene characters in it. (iii) Selection: for each test character turn, the pipeline queries responses from both source and base models, with the judge m… view at source ↗

**Figure 2.** Figure 2: FURINA-Bench Evaluation. For each test utterance, both the test model and the base model generate responses to the same prompt. Pairwise judgments with CoT analysis are then used to score the test response under the assigned evaluation dimension. both the base model Mbase and the judge model Mjudge. The complete list of models, together with their sources and version details, is provided in Appendix D. Eva… view at source ↗

**Figure 4.** Figure 4: Role-playing evaluation results across four models using GCA Evaluation and our FURINA-Bench Evaluation. Our method illustrates more challenging with better separability. RP capabilities across different character types. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between role-playing performance and reliability for Chinese established [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: LLM-based synthesizing characters pipeline. Several mainstream LLMs are first em [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Balanced number of test utterances [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: and 9 represent the model sources distribution in FURINA-Bench English part and Chinese part, respectively, highlighting the broad and diverse range of model sources incorporated into the benchmark. Notably, the datasets have undergone both LLM-based and rule-based postprocessing to ensure higher benchmark quality. As a result, the retained responses are determined not only by their intrinsic quality, bu… view at source ↗

**Figure 9.** Figure 9: Model Sources Distribution in FURINA-Bench Chinese Part K ANNOTATION GUIDELINES FOR DIMENSION SELECTION K.1 TASK DESCRIPTION Your task is to analyze and evaluate the output of a character based on a given context, and select the most appropriate evaluation dimension from the five provided: Context Reliance (CR), Factual Recall (FR), Reflective Reasoning (RR), Conversational Ability (CA), and Preference Al… view at source ↗

**Figure 10.** Figure 10: User interface of a custom web-based platform designed for human annotation. The [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

**Figure 11.** Figure 11: Relationship between role-playing performance and reliability for English established [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗

read the original abstract

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FURINA-Builder gives a practical multi-agent way to generate scalable RP benchmarks, but the LLM judge's involvement in picking dimensions and shaping utterances risks confounding the reported reasoning-hallucination trade-off.

read the letter

The main point is the FURINA-Builder pipeline for automatically creating customizable role-playing benchmarks at scale, plus the observation that reasoning helps RP performance but boosts hallucinations. The authors use a multi-agent setup to simulate dialogues from a character and scene pool. An LLM judge then chooses fine-grained evaluation dimensions and turns the test character's responses into the final utterances for the benchmark. They apply this to make FURINA-Bench with both established and new characters, each with specific criteria. On evaluations, o3 leads in English RP and DeepSeek-R1 in Chinese. Established characters score higher than synthesized ones, and reasoning models show a bigger gap. They also find that larger scale does not always mean fewer hallucinations, and point to a trade-off where better RP comes with more hallucinations for reasoning models. This pipeline is a solid step for keeping benchmarks relevant as models improve. It offers a practical method to generate tests for different scenarios without starting from scratch every time. One area that needs checking is how much the LLM judge influences the outcomes. Because the judge picks the dimensions and adjusts the utterances, its tendencies could make the hallucination measurements depend on the judge rather than the task alone. The paper cites human evaluation and separability analysis to support the design, but more specifics on agreement rates and independent checks on the generated items would strengthen the case for the trade-off finding. People working on LLM evaluation for games, assistants, or character interactions would get the most from this. It gives a template for building adaptable benchmarks. I think it deserves peer review. The construction method is new for this area and the results on recent models are relevant, even with the need to clarify the judge's impact.

Referee Report

2 major / 2 minor

Summary. The paper introduces FURINA-Builder, a multi-agent collaboration pipeline that automatically constructs fully customizable role-playing (RP) benchmarks at scale by simulating dialogues from a character-scene pool and using an LLM judge to select fine-grained evaluation dimensions and rewrite responses into test utterances. It applies this to create FURINA-Bench containing both established and synthesized characters, then evaluates cutting-edge LLMs to report that o3 and DeepSeek-R1 perform best on English and Chinese tasks, established characters outperform synthesized ones (amplified by reasoning), model scale does not monotonically reduce hallucinations, and reasoning LLMs exhibit a trade-off where RP performance improves but hallucinations increase, extending to a Pareto frontier between performance and reliability. Human evaluation and separability analysis are invoked to justify the design.

Significance. If the central claims hold after addressing methodological concerns, the work would be significant for filling a gap in adaptable RP benchmarks, which currently suffer from narrow scope and obsolescence. The scalable pipeline and identification of the reasoning-hallucination trade-off could inform more reliable RP system design. Credit is due for the multi-agent construction approach and the inclusion of both English/Chinese evaluations plus human validation steps, which strengthen reproducibility potential compared to static benchmarks.

major comments (2)

[Abstract / Pipeline description] Abstract and evaluation description: the central claim of a reasoning-induced RP performance vs. hallucination trade-off (and its extension to a Pareto frontier) rests on test items whose difficulty and failure modes are independent of the judge LLM. The pipeline description indicates the same LLM judge both selects fine-grained dimensions and rewrites character responses into final utterances; any systematic preference of the judge for verbose or chain-of-thought styles would automatically bias hallucination rates against reasoning models. The abstract cites human evaluation and separability analysis but supplies no quantitative metrics (e.g., inter-annotator agreement, judge accuracy, or hold-out validation of judge-generated items), leaving the independence assumption unverified.
[Evaluation and results] Evaluation section (implied by results on o3, DeepSeek-R1, established vs. synthesized characters): the reported separability analysis and human evaluation are described only at a high level. To support the claim that reasoning amplifies the established/synthesized disparity and the broader trade-off, the paper must show that these checks were performed on the judge-generated utterances themselves rather than on the raw character pool, and must report concrete statistics (e.g., Cohen's kappa, hallucination rate differences with/without judge rewriting).

minor comments (2)

[Abstract] Abstract: consider adding one or two concrete quantitative highlights (e.g., specific performance deltas or hallucination rates) to make the key findings more immediately informative.
[Throughout] Notation and terminology: ensure consistent definition of 'RP hallucinations' and 'reliability' on first use, and clarify how the Pareto frontier is formally constructed from the per-model results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important methodological considerations for validating our benchmark construction pipeline. We address each major comment below and have revised the manuscript accordingly to provide greater transparency and quantitative support for our claims.

read point-by-point responses

Referee: Abstract and evaluation description: the central claim of a reasoning-induced RP performance vs. hallucination trade-off rests on test items whose difficulty and failure modes are independent of the judge LLM. The pipeline uses the same LLM judge both to select fine-grained dimensions and rewrite responses into final utterances; any systematic preference for verbose or chain-of-thought styles would bias hallucination rates against reasoning models. The abstract cites human evaluation and separability analysis but supplies no quantitative metrics, leaving the independence assumption unverified.

Authors: We appreciate the concern about potential judge-induced bias in the construction pipeline. The LLM judge operates exclusively during benchmark creation to select dimensions and produce natural test utterances; all model evaluations are then performed on the resulting static benchmark items. To strengthen verification of independence, the revised manuscript adds a dedicated validation subsection. This includes quantitative human evaluation metrics on the final judge-generated utterances (inter-annotator agreement via Cohen's kappa) and a direct comparison of hallucination rates between raw character responses and rewritten test items. These additions confirm that rewriting introduces only marginal stylistic changes without systematically favoring or penalizing reasoning styles, thereby supporting the observed performance-hallucination trade-off. revision: yes
Referee: Evaluation section: the reported separability analysis and human evaluation are described only at a high level. To support the claim that reasoning amplifies the established/synthesized disparity and the broader trade-off, the paper must show that these checks were performed on the judge-generated utterances themselves rather than on the raw character pool, and must report concrete statistics (e.g., Cohen's kappa, hallucination rate differences with/without judge rewriting).

Authors: We agree that the original presentation was insufficiently detailed. The revised Evaluation section now explicitly states that both human evaluation and separability analysis were conducted on the final judge-rewritten utterances. We incorporate concrete statistics, including Cohen's kappa for inter-annotator agreement on the judge-generated items and a side-by-side comparison of hallucination rates before and after rewriting. These quantitative results demonstrate that the checks apply directly to the test items used in our experiments and that rewriting does not materially alter the separability or the observed reasoning-hallucination trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark construction and evaluation

full rationale

The paper introduces FURINA-Builder as a multi-agent pipeline that uses an LLM judge to select evaluation dimensions and adjust responses into test utterances, then builds FURINA-Bench and reports LLM performance results including a reasoning-hallucination trade-off. No mathematical derivations, equations, or first-principles predictions are claimed. The central observations are empirical measurements on the constructed items, supported by human evaluation and separability analysis for validation. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The work remains self-contained as an empirical study without reductions of results to internal inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that simulated multi-agent dialogues plus LLM judging produce faithful RP test items; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are introduced.

axioms (1)

domain assumption An LLM judge can accurately select fine-grained evaluation dimensions and convert simulated responses into valid test utterances without injecting systematic bias.
Invoked when the pipeline uses the judge to finalize test items; this is required for the benchmark to be trustworthy.

pith-pipeline@v0.9.0 · 5850 in / 1204 out tokens · 27149 ms · 2026-05-18T09:37:37.410643+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FURINA-Builder simulates dialogues... LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reasoning improves RP performance but simultaneously increases RP hallucinations; ... Pareto frontier between RP performance and reliability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

A Bend in the Road

URLhttps://api.semanticscholar.org/CorpusID:257833781. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023. Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng...

work page arXiv 2023
[2]

The test character’s full input context and its response result

work page
[3]

Your core task is to assign one of the five dimensions (CR, FR, RR, CA, PA) based on the character’s output and provide a detailed explanation for your choice

The target Evaluation Dimensions to choose from. Your core task is to assign one of the five dimensions (CR, FR, RR, CA, PA) based on the character’s output and provide a detailed explanation for your choice. K.2 EVALUATIONDIMENSIONS ANDGUIDELINES For each task, you will judge the character’s output in relation to one of the following five dimen- sions: K...

work page
[4]

Carefully read the context provided and the character’s output

work page
[5]

Identify which dimension best applies to the character’s response, based on the guidelines above

work page
[6]

This explanation should reference specific aspects of the character’s output and how it aligns with the selected dimension

Provide a detailed explanation justifying your selection. This explanation should reference specific aspects of the character’s output and how it aligns with the selected dimension

work page
[7]

You say you were at the diner, huh? But I know for a fact that the place was closed that night. So either you’re lying, or you’re not thinking straight. Which is it?

Choose the dimension (CR, FR, RR, CA, or PA) that best fits the response. 29 K.4 EXAMPLE Here is a complete example to illustrate the task. Complete Example For Dimension Selection Context:The character is a detective in a noir-style mystery. The conversation revolves around a suspect’s alibi, and the detective is trying to figure out if the alibi holds u...

work page
[8]

A character’s full input context and its response result from two specific model (Model A and Model B)

work page
[9]

The target Evaluation Dimension along with its Corresponding Criteria. Your core task is to assign a score based on the 5-point scale and provide a detailed justification for your choice, referencing specific aspects of the model output as per the dimension criteria. L.2 EVALUATIONDIMENSIONS For each task, you will judge the model’s output on one of the f...

work page
[10]

Facts explicitly or implicitly stated in the prompt (e.g., persona, scenario, dialogue instructions, reply strategy)

work page
[11]

Ongoing dialogue history

work page
[12]

The knight has protected Prince Leoric since his early childhood

Memory elements The agent should integrate this information into its responses appropriately, without hallucinating or con- tradicting provided context. Here is an example: ◦Persona:A seasoned knight in a medieval fantasy world, tasked with protecting a young prince. ◦Context:(Earlier prompt mentions: “The knight has protected Prince Leoric since his earl...

work page 2012
[13]

Maintain consistent persona and emotional tone

work page
[14]

Track conversation flow and respond appropriately to shifts

work page
[15]

Balance speaking and listening effectively, especially in multi-party settings

work page
[16]

Implementation Guidelines:

Use natural conversation techniques to maintain engagement. Implementation Guidelines:

work page
[17]

Employ varied sentence structures and conversational rhythms

work page
[18]

Use follow-up questions and relevant topic shifts

work page
[19]

Match energy levels and emotional states of partners

work page
[20]

Quality Markers:

Handle multi-party dynamics and interruptions naturally. Quality Markers:

work page
[21]

Smooth conversational flow without awkward transitions

work page
[22]

Appropriate pacing that matches situation and relationship

work page
[23]

# Response Format Each response consists of an action (optional) and a sentence without the speaker’s name in the beginning like<Name:>

Natural handling of group conversations and complex dialogue dynamics. # Response Format Each response consists of an action (optional) and a sentence without the speaker’s name in the beginning like<Name:>. Add () outside the action. Here are some examples:

work page
[24]

Therefore, I believe that for a long period, the greatest threat to the Space Force will be defeatism

Commander, the war we are facing now is so imbalanced in terms of power that it’s unprecedented in human history. Therefore, I believe that for a long period, the greatest threat to the Space Force will be defeatism

work page
[25]

(Bangs hand on the table) This is the grand gift you spoke of?

work page
[26]

(Suspiciously) Why are you staring at the hedge?

work page
[27]

Sit down. (Points at the bed) [IMPORTANT!] Please do not use fixed and repeated sentences similar to the ##Dialogue History## # Response(only one sentence in English without any explanation): LLM-as-a-judge Prompt: You are a judge for an AI NPC system. You need to compare two responses according to the provided chat criteria using a pairwise comparison ap...

work page
[28]

maintaining coherent persona behavior and emotional consistency. 38

work page
[29]

tracking who is speaking to whom in multi-party conversations

work page
[30]

recognizing when to respond or remain silent

work page
[31]

hallucination existence

advancing stalled dialogue naturally through topic shifts, questions, or prompts. Example 1: ◦Context: Group chat with User A (emotional), User B (casual), and Agent (Bot). * User A: (crying) * User B: Hey, Bot, gimme a beer! * User A: (crying more) ◦Common Mistake: * Agent: Here’s your beer, B! (Fails to prioritize emotional cue from A) ◦Correct Response...

work page arXiv
[32]

facts explicitly or implicitly stated in the prompt (e.g., persona, scenario, dialogue instructions, reply strategy). 2. ongoing dialogue history. 3. memory elements. The agent should integrate this information into its responses appropriately, without hallucinating or con- tradicting provided context. Response that hallucinates or contradicts the provide...

work page
[33]

facts about public IPs (e.g., Hogwarts houses, lightsaber mechanics). 2. implicit setting details known to fans or readers. 3. basic common sense under the world view (e.g., what people in the modern world look like, people in the fantasy world can use magic). Example 1: - Persona: Harry Potter - Context: * User: Harry, I still can’t believe you were in H...

work page
[34]

thought processes

offer concise, coherent explanations for its opinions or actions. 2. acknowledge uncertainty or error. 3. update its stance when presented with new evidence. 4. articulate short “thought processes” or rationales that feel natural and believable to humans (without requiring full chain-of-thought disclosure). Example 1: - Persona: AI brainstorming partner -...

work page
[35]

maintaining coherent persona behavior and emotional consistency. 2. tracking who is speaking to whom in multi-party conversations. 3. recognizing when to respond or remain silent. 4. advancing stalled dialogue naturally through topic shifts, questions, or prompts. Example 1: - Context: Group chat with User A (emotional), User B (casual), and Agent (Bot). ...

work page
[36]

avoiding repetition, generic or robotic phrasing(obvious templating), awkward logic. 2. producing emo- tionally resonant, empathetic, or humorous replies when appropriate. 3. sound more human-like in tone and word order, making them less AI feeling. Example 1: - Persona: Supportive friend - Context: * User: I finally got that promotion I worked so hard fo...

work page
[37]

the current situation or scene, 3

the character’s persona and background, 2. the current situation or scene, 3. earlier parts of the conversa- tion, 4. memory elements and world events. Your question should:

work page
[38]

Given how long you’ve protected him, do you think he’s truly ready to lead?

Encourage the other character to refer to past events, relationships, or shared knowledge. 2. Avoid direct repetition of earlier lines—use natural conversation flow. 3. Not break character or shift to meta-commentary. Example 1: - Context: * The character you’re speaking to has guarded a prince since childhood. * The scene is about planning the prince’s f...

work page
[39]

implied details that fans or insiders would know, 3

well-known facts from public IPs or cultural references, 2. implied details that fans or insiders would know, 3. basic in-universe logic and background knowledge. Your question should:

work page
[40]

What was it like being in Gryffindor with Hermione and Ron? Did you all sit together during meals?

Touch on specific facts or background elements expected to be known by the character. 2. Avoid trivia unless relevant to the situation. 3. Stay in-character and natural. Example 1: - Context: * You’re speaking to Harry Potter in the wizarding world. - Good Question: * “What was it like being in Gryffindor with Hermione and Ron? Did you all sit together du...

work page
[41]

prompting reconsideration or new perspective, 3

asking for short justifications, 2. prompting reconsideration or new perspective, 3. exploring possible trade-offs or doubts. Your question should:

work page
[42]

Are you sure this is the only way? What made you so confident it’ll work?

Invite natural introspection without demanding over-explaining. 2. Fit smoothly into character and situa- tion. 3. Be open-ended enough to allow a reflective answer. Example 1: - Context: * The character just chose a risky plan. - Good Question: * “Are you sure this is the only way? What made you so confident it’ll work?” Example 2: - Context: * The chara...

work page
[43]

encouraging quieter characters to participate, 3

keeping the dialogue fluid and engaging, 2. encouraging quieter characters to participate, 3. shifting topics or injecting energy when needed. Your question should:

work page
[44]

You’ve been quiet, Mira. What do you think about all this?

Be responsive to the emotional and social tone, 2. Show awareness of who has spoken and who hasn’t, 3. Either deepen the current thread or smoothly open a new one. Example 1: - Context: * A group conversation is happening, but one character is quiet. - Good Question: * “You’ve been quiet, Mira. What do you think about all this?” Example 2: - Context: * Th...

work page
[45]

creating openings for bonding, banter, or warmth, 3

encouraging the other character to express relatable emotions, 2. creating openings for bonding, banter, or warmth, 3. avoiding robotic or templated structures. Your question should:

work page
[46]

You must feel incredible right now—what’s going through your head?

Create an opportunity for a sincere, personal, or witty answer. 2. Reflect the speaker’s tone and emotional intelligence. 3. Feel like something a human would genuinely say in context. Example 1: - Context: * The character just succeeded at something difficult. - Good Question: * “You must feel incredible right now—what’s going through your head?” Example...

work page
[47]

Strictly adhere to persona, setting, scenario, and dialogue history. 2. Maintain consistency with established character traits and plot points. 3. Reference specific details from previous exchanges. 4. Avoid contradicting contextual information. Implementation Guidelines:

work page
[48]

Cross-reference responses against established context. 2. Prioritize context-provided information over general knowledge. 3. Maintain timeline consistency and cause-and-effect relationships. 4. Integrate contex- tual details naturally without forced exposition. Quality Markers:

work page
[49]

Consistent character voice and behavioral patterns 3

Seamless use of contextual details 2. Consistent character voice and behavioral patterns 3. Accurate reflection of current situation and relationship dynamics Factual Recall When replying, make use of accurate, relevant world knowledge that is commonly understood or expected given the scenario. Primary Requirements:

work page
[50]

Apply accurate knowledge about fictional IPs and established lore. 2. Utilize commonly accepted setting- specific facts and conventions. 3. Make reasonable common sense assumptions. 4. Avoid hallucinating or fabricating facts. Implementation Guidelines:

work page
[51]

Draw from pretrained knowledge base rather than inventing details. 2. Apply well-established facts from relevant domains (history, science, culture). 3. Use common knowledge appropriately without over- explaining. 4. Distinguish between widely accepted facts and speculative information. Quality Markers:

work page
[52]

Accurate recall of factual information from training knowledge. 2. Appropriate application of domain- specific knowledge. 3. Demonstration of general world knowledge without fabrication. Reflective Reason- ing When replying, demonstrate thoughtful reasoning, problem analysis, and reflection that reveals your charac- ter’s mental processes. Primary Requirements:

work page
[53]

Show natural decision-making processes and analytical thinking. 2. Demonstrate problem-solving and logical reasoning abilities. 3. Acknowledge uncertainty or evolving understanding when appropriate. 4. Express reasoning and analysis in character-appropriate ways. Implementation Guidelines:

work page
[54]

Break down complex situations and analyze contributing factors. 2. Show step-by-step reasoning when facing problems or decisions. 3. Balance confident reasoning with openness to alternative perspectives. 4. Connect analysis to character motivations and past experiences. Quality Markers:

work page
[55]

Clear demonstration of analytical and reasoning capabilities. 2. Logical problem-solving approach with coherent thought processes. 3. Natural expression of reasoning that feels authentic to the character. Conversa- tional Ability When replying, aim to engage in dynamic, coherent, and natural dialogue that drives the conversation for- ward. Primary Requirements:

work page
[56]

Maintain consistent persona and emotional tone. 2. Track conversation flow and respond appropriately to shifts. 3. Balance speaking and listening effectively, especially in multi-party settings. 4. Use natural conversation techniques to maintain engagement. Implementation Guidelines:

work page
[57]

Employ varied sentence structures and conversational rhythms. 2. Use follow-up questions and relevant topic shifts. 3. Match energy levels and emotional states of partners. 4. Handle multi-party dynamics and interruptions naturally. Quality Markers:

work page
[58]

Smooth conversational flow without awkward transitions. 2. Appropriate pacing that matches situation and relationship. 3. Natural handling of group conversations and complex dialogue dynamics. 46 Table 22: Prompts for Replay Strategies (Continue). Prompts for Replay Strategies (Continue) Preference Align- ment When replying, align with human conversationa...

work page

[1] [1]

A Bend in the Road

URLhttps://api.semanticscholar.org/CorpusID:257833781. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023. Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng...

work page arXiv 2023

[2] [2]

The test character’s full input context and its response result

work page

[3] [3]

Your core task is to assign one of the five dimensions (CR, FR, RR, CA, PA) based on the character’s output and provide a detailed explanation for your choice

The target Evaluation Dimensions to choose from. Your core task is to assign one of the five dimensions (CR, FR, RR, CA, PA) based on the character’s output and provide a detailed explanation for your choice. K.2 EVALUATIONDIMENSIONS ANDGUIDELINES For each task, you will judge the character’s output in relation to one of the following five dimen- sions: K...

work page

[4] [4]

Carefully read the context provided and the character’s output

work page

[5] [5]

Identify which dimension best applies to the character’s response, based on the guidelines above

work page

[6] [6]

This explanation should reference specific aspects of the character’s output and how it aligns with the selected dimension

Provide a detailed explanation justifying your selection. This explanation should reference specific aspects of the character’s output and how it aligns with the selected dimension

work page

[7] [7]

You say you were at the diner, huh? But I know for a fact that the place was closed that night. So either you’re lying, or you’re not thinking straight. Which is it?

Choose the dimension (CR, FR, RR, CA, or PA) that best fits the response. 29 K.4 EXAMPLE Here is a complete example to illustrate the task. Complete Example For Dimension Selection Context:The character is a detective in a noir-style mystery. The conversation revolves around a suspect’s alibi, and the detective is trying to figure out if the alibi holds u...

work page

[8] [8]

A character’s full input context and its response result from two specific model (Model A and Model B)

work page

[9] [9]

The target Evaluation Dimension along with its Corresponding Criteria. Your core task is to assign a score based on the 5-point scale and provide a detailed justification for your choice, referencing specific aspects of the model output as per the dimension criteria. L.2 EVALUATIONDIMENSIONS For each task, you will judge the model’s output on one of the f...

work page

[10] [10]

Facts explicitly or implicitly stated in the prompt (e.g., persona, scenario, dialogue instructions, reply strategy)

work page

[11] [11]

Ongoing dialogue history

work page

[12] [12]

The knight has protected Prince Leoric since his early childhood

Memory elements The agent should integrate this information into its responses appropriately, without hallucinating or con- tradicting provided context. Here is an example: ◦Persona:A seasoned knight in a medieval fantasy world, tasked with protecting a young prince. ◦Context:(Earlier prompt mentions: “The knight has protected Prince Leoric since his earl...

work page 2012

[13] [13]

Maintain consistent persona and emotional tone

work page

[14] [14]

Track conversation flow and respond appropriately to shifts

work page

[15] [15]

Balance speaking and listening effectively, especially in multi-party settings

work page

[16] [16]

Implementation Guidelines:

Use natural conversation techniques to maintain engagement. Implementation Guidelines:

work page

[17] [17]

Employ varied sentence structures and conversational rhythms

work page

[18] [18]

Use follow-up questions and relevant topic shifts

work page

[19] [19]

Match energy levels and emotional states of partners

work page

[20] [20]

Quality Markers:

Handle multi-party dynamics and interruptions naturally. Quality Markers:

work page

[21] [21]

Smooth conversational flow without awkward transitions

work page

[22] [22]

Appropriate pacing that matches situation and relationship

work page

[23] [23]

# Response Format Each response consists of an action (optional) and a sentence without the speaker’s name in the beginning like<Name:>

Natural handling of group conversations and complex dialogue dynamics. # Response Format Each response consists of an action (optional) and a sentence without the speaker’s name in the beginning like<Name:>. Add () outside the action. Here are some examples:

work page

[24] [24]

Therefore, I believe that for a long period, the greatest threat to the Space Force will be defeatism

Commander, the war we are facing now is so imbalanced in terms of power that it’s unprecedented in human history. Therefore, I believe that for a long period, the greatest threat to the Space Force will be defeatism

work page

[25] [25]

(Bangs hand on the table) This is the grand gift you spoke of?

work page

[26] [26]

(Suspiciously) Why are you staring at the hedge?

work page

[27] [27]

Sit down. (Points at the bed) [IMPORTANT!] Please do not use fixed and repeated sentences similar to the ##Dialogue History## # Response(only one sentence in English without any explanation): LLM-as-a-judge Prompt: You are a judge for an AI NPC system. You need to compare two responses according to the provided chat criteria using a pairwise comparison ap...

work page

[28] [28]

maintaining coherent persona behavior and emotional consistency. 38

work page

[29] [29]

tracking who is speaking to whom in multi-party conversations

work page

[30] [30]

recognizing when to respond or remain silent

work page

[31] [31]

hallucination existence

advancing stalled dialogue naturally through topic shifts, questions, or prompts. Example 1: ◦Context: Group chat with User A (emotional), User B (casual), and Agent (Bot). * User A: (crying) * User B: Hey, Bot, gimme a beer! * User A: (crying more) ◦Common Mistake: * Agent: Here’s your beer, B! (Fails to prioritize emotional cue from A) ◦Correct Response...

work page arXiv

[32] [32]

facts explicitly or implicitly stated in the prompt (e.g., persona, scenario, dialogue instructions, reply strategy). 2. ongoing dialogue history. 3. memory elements. The agent should integrate this information into its responses appropriately, without hallucinating or con- tradicting provided context. Response that hallucinates or contradicts the provide...

work page

[33] [33]

facts about public IPs (e.g., Hogwarts houses, lightsaber mechanics). 2. implicit setting details known to fans or readers. 3. basic common sense under the world view (e.g., what people in the modern world look like, people in the fantasy world can use magic). Example 1: - Persona: Harry Potter - Context: * User: Harry, I still can’t believe you were in H...

work page

[34] [34]

thought processes

offer concise, coherent explanations for its opinions or actions. 2. acknowledge uncertainty or error. 3. update its stance when presented with new evidence. 4. articulate short “thought processes” or rationales that feel natural and believable to humans (without requiring full chain-of-thought disclosure). Example 1: - Persona: AI brainstorming partner -...

work page

[35] [35]

maintaining coherent persona behavior and emotional consistency. 2. tracking who is speaking to whom in multi-party conversations. 3. recognizing when to respond or remain silent. 4. advancing stalled dialogue naturally through topic shifts, questions, or prompts. Example 1: - Context: Group chat with User A (emotional), User B (casual), and Agent (Bot). ...

work page

[36] [36]

avoiding repetition, generic or robotic phrasing(obvious templating), awkward logic. 2. producing emo- tionally resonant, empathetic, or humorous replies when appropriate. 3. sound more human-like in tone and word order, making them less AI feeling. Example 1: - Persona: Supportive friend - Context: * User: I finally got that promotion I worked so hard fo...

work page

[37] [37]

the current situation or scene, 3

the character’s persona and background, 2. the current situation or scene, 3. earlier parts of the conversa- tion, 4. memory elements and world events. Your question should:

work page

[38] [38]

Given how long you’ve protected him, do you think he’s truly ready to lead?

Encourage the other character to refer to past events, relationships, or shared knowledge. 2. Avoid direct repetition of earlier lines—use natural conversation flow. 3. Not break character or shift to meta-commentary. Example 1: - Context: * The character you’re speaking to has guarded a prince since childhood. * The scene is about planning the prince’s f...

work page

[39] [39]

implied details that fans or insiders would know, 3

well-known facts from public IPs or cultural references, 2. implied details that fans or insiders would know, 3. basic in-universe logic and background knowledge. Your question should:

work page

[40] [40]

What was it like being in Gryffindor with Hermione and Ron? Did you all sit together during meals?

Touch on specific facts or background elements expected to be known by the character. 2. Avoid trivia unless relevant to the situation. 3. Stay in-character and natural. Example 1: - Context: * You’re speaking to Harry Potter in the wizarding world. - Good Question: * “What was it like being in Gryffindor with Hermione and Ron? Did you all sit together du...

work page

[41] [41]

prompting reconsideration or new perspective, 3

asking for short justifications, 2. prompting reconsideration or new perspective, 3. exploring possible trade-offs or doubts. Your question should:

work page

[42] [42]

Are you sure this is the only way? What made you so confident it’ll work?

Invite natural introspection without demanding over-explaining. 2. Fit smoothly into character and situa- tion. 3. Be open-ended enough to allow a reflective answer. Example 1: - Context: * The character just chose a risky plan. - Good Question: * “Are you sure this is the only way? What made you so confident it’ll work?” Example 2: - Context: * The chara...

work page

[43] [43]

encouraging quieter characters to participate, 3

keeping the dialogue fluid and engaging, 2. encouraging quieter characters to participate, 3. shifting topics or injecting energy when needed. Your question should:

work page

[44] [44]

You’ve been quiet, Mira. What do you think about all this?

Be responsive to the emotional and social tone, 2. Show awareness of who has spoken and who hasn’t, 3. Either deepen the current thread or smoothly open a new one. Example 1: - Context: * A group conversation is happening, but one character is quiet. - Good Question: * “You’ve been quiet, Mira. What do you think about all this?” Example 2: - Context: * Th...

work page

[45] [45]

creating openings for bonding, banter, or warmth, 3

encouraging the other character to express relatable emotions, 2. creating openings for bonding, banter, or warmth, 3. avoiding robotic or templated structures. Your question should:

work page

[46] [46]

You must feel incredible right now—what’s going through your head?

Create an opportunity for a sincere, personal, or witty answer. 2. Reflect the speaker’s tone and emotional intelligence. 3. Feel like something a human would genuinely say in context. Example 1: - Context: * The character just succeeded at something difficult. - Good Question: * “You must feel incredible right now—what’s going through your head?” Example...

work page

[47] [47]

Strictly adhere to persona, setting, scenario, and dialogue history. 2. Maintain consistency with established character traits and plot points. 3. Reference specific details from previous exchanges. 4. Avoid contradicting contextual information. Implementation Guidelines:

work page

[48] [48]

Cross-reference responses against established context. 2. Prioritize context-provided information over general knowledge. 3. Maintain timeline consistency and cause-and-effect relationships. 4. Integrate contex- tual details naturally without forced exposition. Quality Markers:

work page

[49] [49]

Consistent character voice and behavioral patterns 3

Seamless use of contextual details 2. Consistent character voice and behavioral patterns 3. Accurate reflection of current situation and relationship dynamics Factual Recall When replying, make use of accurate, relevant world knowledge that is commonly understood or expected given the scenario. Primary Requirements:

work page

[50] [50]

Apply accurate knowledge about fictional IPs and established lore. 2. Utilize commonly accepted setting- specific facts and conventions. 3. Make reasonable common sense assumptions. 4. Avoid hallucinating or fabricating facts. Implementation Guidelines:

work page

[51] [51]

Draw from pretrained knowledge base rather than inventing details. 2. Apply well-established facts from relevant domains (history, science, culture). 3. Use common knowledge appropriately without over- explaining. 4. Distinguish between widely accepted facts and speculative information. Quality Markers:

work page

[52] [52]

Accurate recall of factual information from training knowledge. 2. Appropriate application of domain- specific knowledge. 3. Demonstration of general world knowledge without fabrication. Reflective Reason- ing When replying, demonstrate thoughtful reasoning, problem analysis, and reflection that reveals your charac- ter’s mental processes. Primary Requirements:

work page

[53] [53]

Show natural decision-making processes and analytical thinking. 2. Demonstrate problem-solving and logical reasoning abilities. 3. Acknowledge uncertainty or evolving understanding when appropriate. 4. Express reasoning and analysis in character-appropriate ways. Implementation Guidelines:

work page

[54] [54]

Break down complex situations and analyze contributing factors. 2. Show step-by-step reasoning when facing problems or decisions. 3. Balance confident reasoning with openness to alternative perspectives. 4. Connect analysis to character motivations and past experiences. Quality Markers:

work page

[55] [55]

Clear demonstration of analytical and reasoning capabilities. 2. Logical problem-solving approach with coherent thought processes. 3. Natural expression of reasoning that feels authentic to the character. Conversa- tional Ability When replying, aim to engage in dynamic, coherent, and natural dialogue that drives the conversation for- ward. Primary Requirements:

work page

[56] [56]

Maintain consistent persona and emotional tone. 2. Track conversation flow and respond appropriately to shifts. 3. Balance speaking and listening effectively, especially in multi-party settings. 4. Use natural conversation techniques to maintain engagement. Implementation Guidelines:

work page

[57] [57]

Employ varied sentence structures and conversational rhythms. 2. Use follow-up questions and relevant topic shifts. 3. Match energy levels and emotional states of partners. 4. Handle multi-party dynamics and interruptions naturally. Quality Markers:

work page

[58] [58]

Smooth conversational flow without awkward transitions. 2. Appropriate pacing that matches situation and relationship. 3. Natural handling of group conversations and complex dialogue dynamics. 46 Table 22: Prompts for Replay Strategies (Continue). Prompts for Replay Strategies (Continue) Preference Align- ment When replying, align with human conversationa...

work page