FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Pith reviewed 2026-05-18 09:37 UTC · model grok-4.3
The pith
A multi-agent pipeline automatically builds customizable role-playing benchmarks that expose a performance-reliability trade-off in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FURINA-Builder constructs fully customizable role-playing benchmarks at any scale by simulating dialogues and letting an LLM judge define dimension-specific criteria and test utterances. Applied to both established and synthesized characters, the resulting benchmark shows o3 leading on English tasks and DeepSeek-R1 on Chinese tasks, with established characters outperforming synthesized ones and reasoning capabilities widening that gap. Across models the work identifies that reasoning improves role-play performance while simultaneously increasing hallucinations, extending to a Pareto frontier between performance and reliability.
What carries the argument
FURINA-Builder, the multi-agent collaboration pipeline that draws characters and scenes from a pool, simulates dialogues, and relies on an LLM judge to select evaluation dimensions and generate final test utterances.
If this is right
- Established characters consistently outperform synthesized ones across evaluated models.
- Reasoning capabilities amplify the performance advantage of established characters over synthesized ones.
- Model scale does not produce a steady reduction in role-play hallucinations.
- The observed trade-off between role-play performance and hallucination rate extends to a Pareto frontier for all tested LLMs.
- o3 achieves the highest scores on English role-play tasks while DeepSeek-R1 leads on Chinese tasks.
Where Pith is reading between the lines
- Benchmark construction pipelines of this type could be adapted to evaluate other interactive skills such as negotiation or collaborative problem-solving.
- Training approaches that improve reasoning might require additional mechanisms to preserve factual consistency during extended dialogues.
- Future evaluations could track both accuracy and reliability metrics jointly rather than treating them as separate dimensions.
- If the trade-off holds, model developers may need to optimize along the frontier instead of pursuing maximum performance alone.
Load-bearing premise
The LLM judge selects unbiased fine-grained dimensions and produces valid test utterances without injecting its own errors or hallucinations into the benchmark items.
What would settle it
Human review of the generated test utterances and dimension selections reveals systematic judge-induced biases or hallucinations that alter model rankings in ways not explained by the intended role-play criteria.
Figures
read the original abstract
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FURINA-Builder, a multi-agent collaboration pipeline that automatically constructs fully customizable role-playing (RP) benchmarks at scale by simulating dialogues from a character-scene pool and using an LLM judge to select fine-grained evaluation dimensions and rewrite responses into test utterances. It applies this to create FURINA-Bench containing both established and synthesized characters, then evaluates cutting-edge LLMs to report that o3 and DeepSeek-R1 perform best on English and Chinese tasks, established characters outperform synthesized ones (amplified by reasoning), model scale does not monotonically reduce hallucinations, and reasoning LLMs exhibit a trade-off where RP performance improves but hallucinations increase, extending to a Pareto frontier between performance and reliability. Human evaluation and separability analysis are invoked to justify the design.
Significance. If the central claims hold after addressing methodological concerns, the work would be significant for filling a gap in adaptable RP benchmarks, which currently suffer from narrow scope and obsolescence. The scalable pipeline and identification of the reasoning-hallucination trade-off could inform more reliable RP system design. Credit is due for the multi-agent construction approach and the inclusion of both English/Chinese evaluations plus human validation steps, which strengthen reproducibility potential compared to static benchmarks.
major comments (2)
- [Abstract / Pipeline description] Abstract and evaluation description: the central claim of a reasoning-induced RP performance vs. hallucination trade-off (and its extension to a Pareto frontier) rests on test items whose difficulty and failure modes are independent of the judge LLM. The pipeline description indicates the same LLM judge both selects fine-grained dimensions and rewrites character responses into final utterances; any systematic preference of the judge for verbose or chain-of-thought styles would automatically bias hallucination rates against reasoning models. The abstract cites human evaluation and separability analysis but supplies no quantitative metrics (e.g., inter-annotator agreement, judge accuracy, or hold-out validation of judge-generated items), leaving the independence assumption unverified.
- [Evaluation and results] Evaluation section (implied by results on o3, DeepSeek-R1, established vs. synthesized characters): the reported separability analysis and human evaluation are described only at a high level. To support the claim that reasoning amplifies the established/synthesized disparity and the broader trade-off, the paper must show that these checks were performed on the judge-generated utterances themselves rather than on the raw character pool, and must report concrete statistics (e.g., Cohen's kappa, hallucination rate differences with/without judge rewriting).
minor comments (2)
- [Abstract] Abstract: consider adding one or two concrete quantitative highlights (e.g., specific performance deltas or hallucination rates) to make the key findings more immediately informative.
- [Throughout] Notation and terminology: ensure consistent definition of 'RP hallucinations' and 'reliability' on first use, and clarify how the Pareto frontier is formally constructed from the per-model results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important methodological considerations for validating our benchmark construction pipeline. We address each major comment below and have revised the manuscript accordingly to provide greater transparency and quantitative support for our claims.
read point-by-point responses
-
Referee: Abstract and evaluation description: the central claim of a reasoning-induced RP performance vs. hallucination trade-off rests on test items whose difficulty and failure modes are independent of the judge LLM. The pipeline uses the same LLM judge both to select fine-grained dimensions and rewrite responses into final utterances; any systematic preference for verbose or chain-of-thought styles would bias hallucination rates against reasoning models. The abstract cites human evaluation and separability analysis but supplies no quantitative metrics, leaving the independence assumption unverified.
Authors: We appreciate the concern about potential judge-induced bias in the construction pipeline. The LLM judge operates exclusively during benchmark creation to select dimensions and produce natural test utterances; all model evaluations are then performed on the resulting static benchmark items. To strengthen verification of independence, the revised manuscript adds a dedicated validation subsection. This includes quantitative human evaluation metrics on the final judge-generated utterances (inter-annotator agreement via Cohen's kappa) and a direct comparison of hallucination rates between raw character responses and rewritten test items. These additions confirm that rewriting introduces only marginal stylistic changes without systematically favoring or penalizing reasoning styles, thereby supporting the observed performance-hallucination trade-off. revision: yes
-
Referee: Evaluation section: the reported separability analysis and human evaluation are described only at a high level. To support the claim that reasoning amplifies the established/synthesized disparity and the broader trade-off, the paper must show that these checks were performed on the judge-generated utterances themselves rather than on the raw character pool, and must report concrete statistics (e.g., Cohen's kappa, hallucination rate differences with/without judge rewriting).
Authors: We agree that the original presentation was insufficiently detailed. The revised Evaluation section now explicitly states that both human evaluation and separability analysis were conducted on the final judge-rewritten utterances. We incorporate concrete statistics, including Cohen's kappa for inter-annotator agreement on the judge-generated items and a side-by-side comparison of hallucination rates before and after rewriting. These quantitative results demonstrate that the checks apply directly to the test items used in our experiments and that rewriting does not materially alter the separability or the observed reasoning-hallucination trade-off. revision: yes
Circularity Check
No significant circularity in empirical benchmark construction and evaluation
full rationale
The paper introduces FURINA-Builder as a multi-agent pipeline that uses an LLM judge to select evaluation dimensions and adjust responses into test utterances, then builds FURINA-Bench and reports LLM performance results including a reasoning-hallucination trade-off. No mathematical derivations, equations, or first-principles predictions are claimed. The central observations are empirical measurements on the constructed items, supported by human evaluation and separability analysis for validation. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The work remains self-contained as an empirical study without reductions of results to internal inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An LLM judge can accurately select fine-grained evaluation dimensions and convert simulated responses into valid test utterances without injecting systematic bias.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FURINA-Builder simulates dialogues... LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reasoning improves RP performance but simultaneously increases RP hallucinations; ... Pareto frontier between RP performance and reliability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://api.semanticscholar.org/CorpusID:257833781. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023. Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng...
-
[2]
The test character’s full input context and its response result
-
[3]
The target Evaluation Dimensions to choose from. Your core task is to assign one of the five dimensions (CR, FR, RR, CA, PA) based on the character’s output and provide a detailed explanation for your choice. K.2 EVALUATIONDIMENSIONS ANDGUIDELINES For each task, you will judge the character’s output in relation to one of the following five dimen- sions: K...
-
[4]
Carefully read the context provided and the character’s output
-
[5]
Identify which dimension best applies to the character’s response, based on the guidelines above
-
[6]
Provide a detailed explanation justifying your selection. This explanation should reference specific aspects of the character’s output and how it aligns with the selected dimension
-
[7]
Choose the dimension (CR, FR, RR, CA, or PA) that best fits the response. 29 K.4 EXAMPLE Here is a complete example to illustrate the task. Complete Example For Dimension Selection Context:The character is a detective in a noir-style mystery. The conversation revolves around a suspect’s alibi, and the detective is trying to figure out if the alibi holds u...
-
[8]
A character’s full input context and its response result from two specific model (Model A and Model B)
-
[9]
The target Evaluation Dimension along with its Corresponding Criteria. Your core task is to assign a score based on the 5-point scale and provide a detailed justification for your choice, referencing specific aspects of the model output as per the dimension criteria. L.2 EVALUATIONDIMENSIONS For each task, you will judge the model’s output on one of the f...
-
[10]
Facts explicitly or implicitly stated in the prompt (e.g., persona, scenario, dialogue instructions, reply strategy)
-
[11]
Ongoing dialogue history
-
[12]
The knight has protected Prince Leoric since his early childhood
Memory elements The agent should integrate this information into its responses appropriately, without hallucinating or con- tradicting provided context. Here is an example: ◦Persona:A seasoned knight in a medieval fantasy world, tasked with protecting a young prince. ◦Context:(Earlier prompt mentions: “The knight has protected Prince Leoric since his earl...
work page 2012
-
[13]
Maintain consistent persona and emotional tone
-
[14]
Track conversation flow and respond appropriately to shifts
-
[15]
Balance speaking and listening effectively, especially in multi-party settings
-
[16]
Use natural conversation techniques to maintain engagement. Implementation Guidelines:
-
[17]
Employ varied sentence structures and conversational rhythms
-
[18]
Use follow-up questions and relevant topic shifts
-
[19]
Match energy levels and emotional states of partners
-
[20]
Handle multi-party dynamics and interruptions naturally. Quality Markers:
-
[21]
Smooth conversational flow without awkward transitions
-
[22]
Appropriate pacing that matches situation and relationship
-
[23]
Natural handling of group conversations and complex dialogue dynamics. # Response Format Each response consists of an action (optional) and a sentence without the speaker’s name in the beginning like<Name:>. Add () outside the action. Here are some examples:
-
[24]
Commander, the war we are facing now is so imbalanced in terms of power that it’s unprecedented in human history. Therefore, I believe that for a long period, the greatest threat to the Space Force will be defeatism
-
[25]
(Bangs hand on the table) This is the grand gift you spoke of?
-
[26]
(Suspiciously) Why are you staring at the hedge?
-
[27]
Sit down. (Points at the bed) [IMPORTANT!] Please do not use fixed and repeated sentences similar to the ##Dialogue History## # Response(only one sentence in English without any explanation): LLM-as-a-judge Prompt: You are a judge for an AI NPC system. You need to compare two responses according to the provided chat criteria using a pairwise comparison ap...
-
[28]
maintaining coherent persona behavior and emotional consistency. 38
-
[29]
tracking who is speaking to whom in multi-party conversations
-
[30]
recognizing when to respond or remain silent
-
[31]
advancing stalled dialogue naturally through topic shifts, questions, or prompts. Example 1: ◦Context: Group chat with User A (emotional), User B (casual), and Agent (Bot). * User A: (crying) * User B: Hey, Bot, gimme a beer! * User A: (crying more) ◦Common Mistake: * Agent: Here’s your beer, B! (Fails to prioritize emotional cue from A) ◦Correct Response...
-
[32]
facts explicitly or implicitly stated in the prompt (e.g., persona, scenario, dialogue instructions, reply strategy). 2. ongoing dialogue history. 3. memory elements. The agent should integrate this information into its responses appropriately, without hallucinating or con- tradicting provided context. Response that hallucinates or contradicts the provide...
-
[33]
facts about public IPs (e.g., Hogwarts houses, lightsaber mechanics). 2. implicit setting details known to fans or readers. 3. basic common sense under the world view (e.g., what people in the modern world look like, people in the fantasy world can use magic). Example 1: - Persona: Harry Potter - Context: * User: Harry, I still can’t believe you were in H...
-
[34]
offer concise, coherent explanations for its opinions or actions. 2. acknowledge uncertainty or error. 3. update its stance when presented with new evidence. 4. articulate short “thought processes” or rationales that feel natural and believable to humans (without requiring full chain-of-thought disclosure). Example 1: - Persona: AI brainstorming partner -...
-
[35]
maintaining coherent persona behavior and emotional consistency. 2. tracking who is speaking to whom in multi-party conversations. 3. recognizing when to respond or remain silent. 4. advancing stalled dialogue naturally through topic shifts, questions, or prompts. Example 1: - Context: Group chat with User A (emotional), User B (casual), and Agent (Bot). ...
-
[36]
avoiding repetition, generic or robotic phrasing(obvious templating), awkward logic. 2. producing emo- tionally resonant, empathetic, or humorous replies when appropriate. 3. sound more human-like in tone and word order, making them less AI feeling. Example 1: - Persona: Supportive friend - Context: * User: I finally got that promotion I worked so hard fo...
-
[37]
the current situation or scene, 3
the character’s persona and background, 2. the current situation or scene, 3. earlier parts of the conversa- tion, 4. memory elements and world events. Your question should:
-
[38]
Given how long you’ve protected him, do you think he’s truly ready to lead?
Encourage the other character to refer to past events, relationships, or shared knowledge. 2. Avoid direct repetition of earlier lines—use natural conversation flow. 3. Not break character or shift to meta-commentary. Example 1: - Context: * The character you’re speaking to has guarded a prince since childhood. * The scene is about planning the prince’s f...
-
[39]
implied details that fans or insiders would know, 3
well-known facts from public IPs or cultural references, 2. implied details that fans or insiders would know, 3. basic in-universe logic and background knowledge. Your question should:
-
[40]
What was it like being in Gryffindor with Hermione and Ron? Did you all sit together during meals?
Touch on specific facts or background elements expected to be known by the character. 2. Avoid trivia unless relevant to the situation. 3. Stay in-character and natural. Example 1: - Context: * You’re speaking to Harry Potter in the wizarding world. - Good Question: * “What was it like being in Gryffindor with Hermione and Ron? Did you all sit together du...
-
[41]
prompting reconsideration or new perspective, 3
asking for short justifications, 2. prompting reconsideration or new perspective, 3. exploring possible trade-offs or doubts. Your question should:
-
[42]
Are you sure this is the only way? What made you so confident it’ll work?
Invite natural introspection without demanding over-explaining. 2. Fit smoothly into character and situa- tion. 3. Be open-ended enough to allow a reflective answer. Example 1: - Context: * The character just chose a risky plan. - Good Question: * “Are you sure this is the only way? What made you so confident it’ll work?” Example 2: - Context: * The chara...
-
[43]
encouraging quieter characters to participate, 3
keeping the dialogue fluid and engaging, 2. encouraging quieter characters to participate, 3. shifting topics or injecting energy when needed. Your question should:
-
[44]
You’ve been quiet, Mira. What do you think about all this?
Be responsive to the emotional and social tone, 2. Show awareness of who has spoken and who hasn’t, 3. Either deepen the current thread or smoothly open a new one. Example 1: - Context: * A group conversation is happening, but one character is quiet. - Good Question: * “You’ve been quiet, Mira. What do you think about all this?” Example 2: - Context: * Th...
-
[45]
creating openings for bonding, banter, or warmth, 3
encouraging the other character to express relatable emotions, 2. creating openings for bonding, banter, or warmth, 3. avoiding robotic or templated structures. Your question should:
-
[46]
You must feel incredible right now—what’s going through your head?
Create an opportunity for a sincere, personal, or witty answer. 2. Reflect the speaker’s tone and emotional intelligence. 3. Feel like something a human would genuinely say in context. Example 1: - Context: * The character just succeeded at something difficult. - Good Question: * “You must feel incredible right now—what’s going through your head?” Example...
-
[47]
Strictly adhere to persona, setting, scenario, and dialogue history. 2. Maintain consistency with established character traits and plot points. 3. Reference specific details from previous exchanges. 4. Avoid contradicting contextual information. Implementation Guidelines:
-
[48]
Cross-reference responses against established context. 2. Prioritize context-provided information over general knowledge. 3. Maintain timeline consistency and cause-and-effect relationships. 4. Integrate contex- tual details naturally without forced exposition. Quality Markers:
-
[49]
Consistent character voice and behavioral patterns 3
Seamless use of contextual details 2. Consistent character voice and behavioral patterns 3. Accurate reflection of current situation and relationship dynamics Factual Recall When replying, make use of accurate, relevant world knowledge that is commonly understood or expected given the scenario. Primary Requirements:
-
[50]
Apply accurate knowledge about fictional IPs and established lore. 2. Utilize commonly accepted setting- specific facts and conventions. 3. Make reasonable common sense assumptions. 4. Avoid hallucinating or fabricating facts. Implementation Guidelines:
-
[51]
Draw from pretrained knowledge base rather than inventing details. 2. Apply well-established facts from relevant domains (history, science, culture). 3. Use common knowledge appropriately without over- explaining. 4. Distinguish between widely accepted facts and speculative information. Quality Markers:
-
[52]
Accurate recall of factual information from training knowledge. 2. Appropriate application of domain- specific knowledge. 3. Demonstration of general world knowledge without fabrication. Reflective Reason- ing When replying, demonstrate thoughtful reasoning, problem analysis, and reflection that reveals your charac- ter’s mental processes. Primary Requirements:
-
[53]
Show natural decision-making processes and analytical thinking. 2. Demonstrate problem-solving and logical reasoning abilities. 3. Acknowledge uncertainty or evolving understanding when appropriate. 4. Express reasoning and analysis in character-appropriate ways. Implementation Guidelines:
-
[54]
Break down complex situations and analyze contributing factors. 2. Show step-by-step reasoning when facing problems or decisions. 3. Balance confident reasoning with openness to alternative perspectives. 4. Connect analysis to character motivations and past experiences. Quality Markers:
-
[55]
Clear demonstration of analytical and reasoning capabilities. 2. Logical problem-solving approach with coherent thought processes. 3. Natural expression of reasoning that feels authentic to the character. Conversa- tional Ability When replying, aim to engage in dynamic, coherent, and natural dialogue that drives the conversation for- ward. Primary Requirements:
-
[56]
Maintain consistent persona and emotional tone. 2. Track conversation flow and respond appropriately to shifts. 3. Balance speaking and listening effectively, especially in multi-party settings. 4. Use natural conversation techniques to maintain engagement. Implementation Guidelines:
-
[57]
Employ varied sentence structures and conversational rhythms. 2. Use follow-up questions and relevant topic shifts. 3. Match energy levels and emotional states of partners. 4. Handle multi-party dynamics and interruptions naturally. Quality Markers:
-
[58]
Smooth conversational flow without awkward transitions. 2. Appropriate pacing that matches situation and relationship. 3. Natural handling of group conversations and complex dialogue dynamics. 46 Table 22: Prompts for Replay Strategies (Continue). Prompts for Replay Strategies (Continue) Preference Align- ment When replying, align with human conversationa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.