arxiv: 2605.08129 · v1 · submitted 2026-05-01 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Towards Customized Multimodal Role-Play

Chao Tang , Jianzong Wu , Qingyu Shi , Ye Tian , Aixi Zhang , Hao Jiang , Jiangning Zhang , Yunhai Tong

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords Customized Multimodal Role-PlayCMRPRoleScape-20UniCharacterUnified-SFTCharacter-GRPOfew-shot customizationcross-modal consistency

0 comments

The pith

A unified multimodal model learns a custom character's persona, style and visual identity from only ten images plus example interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Customized Multimodal Role-Play task and builds the RoleScape-20 dataset of twenty characters to test whether a single model can keep a chosen persona, dialogue manner, and appearance consistent when it both writes text and generates images. It presents UniCharacter, a two-stage process of unified supervised fine-tuning followed by character-specific group relative policy optimization that achieves this coherence after training on roughly ten images and their paired interactions. A sympathetic reader would care because the result suggests AI companions or game agents could be quickly tailored to feel like specific, repeatable individuals rather than generic responders, without needing massive per-character datasets. Experiments show the method beats earlier approaches on the new dataset, and ablations confirm that the cross-modal consistency mechanisms and few-shot strategy each contribute.

Core claim

We introduce the Customized Multimodal Role-Play task together with the RoleScape-20 dataset that supplies training and evaluation data for persona, stylistic descriptions, visual cues, and text-image interactions across twenty characters. Building on a unified model, the UniCharacter framework applies Unified Supervised Finetuning followed by Character-GRPO; given only ten images plus corresponding interaction examples, the model acquires the target character and produces coherent persona, style, and visual identity in both generated text and images after approximately one hundred GPU hours of training, substantially outperforming prior approaches.

What carries the argument

UniCharacter, the two-stage training framework of Unified-SFT followed by Character-GRPO, that transfers a small set of per-character images and dialogues into cross-modal output consistency.

If this is right

The model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images.
The approach substantially outperforms prior methods on the RoleScape-20 dataset.
Ablation studies confirm the effectiveness of the cross-modal consistency design and the few-shot customization strategy.
CMRP together with unified modeling supplies a basis for next-generation characterful and immersive interactive agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low data requirement per character suggests the method could be applied to create many distinct agents without prohibitive compute.
The same few-shot consistency mechanism might extend to additional modalities such as audio or short video clips.
Personal users could potentially fine-tune their own private characters on consumer hardware if the 100-GPU-hour cost can be reduced further.

Load-bearing premise

That the two-stage training on a small per-character set will produce stable cross-modal consistency without overfitting or mode collapse.

What would settle it

A held-out test where the trained model is asked to continue a new conversation and generate an image; clear mismatch between the generated image or text and the character's established visual or personality cues would falsify the claim of acquired coherent identity.

Figures

Figures reproduced from arXiv: 2605.08129 by Aixi Zhang, Chao Tang, Hao Jiang, Jiangning Zhang, Jianzong Wu, Qingyu Shi, Ye Tian, Yunhai Tong.

**Figure 1.** Figure 1: Demonstration of the UniCharacter model’s capabilities. The model utilizes a character’s profile to maintain consistency across several integrated tasks. The core innovation is showcased in Multimodal Role Play, where the model simultaneously generates a coherent textual response and a corresponding visual image that reflects the character’s emotion. This unified generation is supplemented by the model’s a… view at source ↗

**Figure 2.** Figure 2: Data Construction Pipeline of RoleScape-20 Dataset. The data construction pipeline processes raw character materials (dialogues, images, profiles) into diverse training data, including multimodal role-play dialogues, T2I generation pairs, knowledge QA, and VQA pairs. Visual Question Answering (VQA). VQA evaluates the model’s understanding of the character’s visual attributes. Given a reference image Iref ∈… view at source ↗

**Figure 3.** Figure 3: Overview of the UniCharacter framework. Stage 1 focuses on Unified-SFT, using MSE loss for image outputs and CE loss for text outputs. Stage 2 implements Character-GRPO, optimizing the policy πθ via a multi-reward mechanism that considers both text-image alignment and generation diversity. for that specific context, resulting in a richly annotated data tuple (Ij , Qu, Rm, Thinking Process,Instruction). Thi… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between UniCharacter and DraeamBooth (Ruiz et al., 2022), UniCTokens (An et al., 2025) and Qwen2.5-VL (Bai et al., 2025) on various tasks and cases. In Role-Play cases, UniCharacter is superior for it effectively shows Chandler’s personality through concise, sarcastic humor, while Qwen2.5-VL breaks character by being long-winded and explaining his feelings too much. Knowledge QA and … view at source ↗

**Figure 5.** Figure 5: Qualitative results on two characters. The two characters correspond to Chandler and Joey in the dataset. We qualitatively demonstrate the model’s performance across multimodal role-play, text-to-image generation, Knowledge QA, and VQA. metrics than those trained without it. Qualitative results illustrating the impact of the GRPO training stage are also presented in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons of the ablation studies on model training stage (w/o GRPO and w/ GRPO). where: f(r, A, θ,ϵ, β ˆ ) = 1 G X G i=1 1 T T X−1 t=0 min r i t (θ)Aˆi t , clip(r i t (θ), 1 − ϵ, 1 + ϵ)Aˆi t − βDKL(πθ||πref) and r i t (θ) = pθ(x i t−1 |x i t ,c) pθold (xi t−1 |xi t ,c) is the importance sampling ratio. Specifically, we set β = 0, meaning there is no KL-divergence penalty. C. More Qua… view at source ↗

**Figure 7.** Figure 7: User study across tasks. We conducted a user study on four methods—DreamBooth, Qwen2.5-VL, UniCTokens, and UniCharacter—across three tasks: T2I generation, multimodal role-play, and text role-play. The width of all three bar charts is uniformly set to 100%. Different colors represent different models, with percentages labeled on the corresponding bars. UniCharacter outperformed the baselines across all tas… view at source ↗

**Figure 8.** Figure 8: An overview of RoleScape-20. RoleScape-20 contains 9 human characters, 4 animal characters, and 7 anime characters. F. Detailed Data Construction Pipeline F.1. Data Collection Our comprehensive list of collected characters includes: • Human:, Adrien Brody, Coco, Friends-Chandler, Friends-Joey, Gao Qiqiang, Harry Potter-Hermione, Leonardo, Will In Vietnam, Wukong • Animal:, Bo, Butin, Mam, Mydieu • Anime Ch… view at source ↗

read the original abstract

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new few-shot multimodal character customization task and offers a two-stage training recipe that claims consistent cross-modal outputs on a small new dataset.

read the letter

The main point is that this work defines a new task for few-shot customization of multimodal characters and gives a two-stage fine-tuning recipe that reportedly keeps everything consistent from just 10 examples. They introduce CMRP and the RoleScape-20 dataset with 20 characters that bundle persona, style, visual cues, and text-image pairs. UniCharacter does a broad unified SFT pass then switches to per-character GRPO. The abstract says this delivers coherent persona, style, and visual identity in text and image outputs, with ablations supporting the choices, all in about 100 GPU hours per character. They claim it beats prior approaches on their benchmark. The paper does a good job laying out a concrete problem and solution for something that matters in building characterful AI agents. The dataset is a plus for the field, and the two-stage idea is practical for balancing general and specific capabilities. Soft spots include the absence of any actual performance numbers or baseline specifics, which makes the substantially outperforms claim difficult to evaluate. The dataset size is modest, so questions remain about whether it captures enough diversity to support the few-shot claims without overfitting. The compute requirement also seems high for a customization method. Without more on how consistency is enforced in the optimization step, it's tough to judge stability. This is for researchers in multimodal learning and interactive AI systems. Readers looking for new task definitions and training recipes in role-play could find it worth reading. I would send it out for peer review. The new task and framework are solid enough to get useful feedback, even with the current gaps in reported evidence.

Referee Report

0 major / 3 minor

Summary. The paper introduces the Customized Multimodal Role-Play (CMRP) task along with the RoleScape-20 dataset containing 20 characters and associated training/evaluation data on persona, style, visual cues, and text-image interactions. It proposes the UniCharacter framework, a two-stage approach consisting of Unified Supervised Finetuning (Unified-SFT) followed by character-specific group relative policy optimization (Character-GRPO). The central claim is that, given only 10 images plus corresponding interaction examples, a unified multimodal model acquires the target character and produces coherent outputs in both text (persona and dialogue style) and images (visual identity), with the full process requiring approximately 100 GPU hours. Experiments on RoleScape-20 report substantial outperformance over prior approaches, supported by ablation studies on the cross-modal consistency design and few-shot strategy.

Significance. If the empirical results hold, this work is significant for filling a gap in jointly customizing multimodal character attributes while preserving cross-modal consistency, which is essential for immersive interactive agents. The introduction of a dedicated benchmark dataset and a practical few-shot customization method provides a concrete foundation for future research in personalized multimodal generation. Credit is due for the empirical focus on a newly constructed dataset and the adaptation of GRPO to character-specific optimization, which together enable verifiable claims about few-shot acquisition of coherent persona, style, and visual identity.

minor comments (3)

[Abstract] The abstract states that the method 'substantially outperforms prior approaches' and that ablations 'validate the effectiveness,' but does not name the specific baselines or report any quantitative metrics (e.g., accuracy, consistency scores, or human evaluation results). Adding a summary table of key results with baseline names and effect sizes in the experiments section would strengthen the central empirical claim.
[Dataset section (assumed §2 or §4)] The RoleScape-20 dataset is described at a high level (20 characters, covering persona, stylistic descriptions, visual/expressive cues, and interactions). A table or appendix detailing per-character splits (training vs. evaluation examples), annotation process for the 10-image few-shot sets, and how cross-modal consistency is measured would improve reproducibility and allow readers to assess whether the dataset sufficiently represents real-world customization needs.
[Method section (assumed §3)] The Character-GRPO stage is introduced as a key component for character-specific optimization, yet the abstract provides no equations, pseudocode, or explicit differences from standard GRPO. Including the precise reward formulation or group-relative objective in the method section (with a numbered equation) would clarify how it enforces cross-modal consistency without mode collapse.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and positive assessment of our work on Customized Multimodal Role-Play. The summary accurately reflects the CMRP task, RoleScape-20 dataset, and UniCharacter framework. We are pleased that the referee recognizes the significance of jointly customizing multimodal character attributes with cross-modal consistency and recommends minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical training on constructed dataset

full rationale

The paper introduces a new task (CMRP) and dataset (RoleScape-20), then describes a two-stage empirical training procedure (Unified-SFT followed by Character-GRPO) evaluated on that dataset. No mathematical derivation, first-principles claim, or prediction is presented that reduces by construction to fitted parameters or self-citations. Ablations and results are reported as direct experimental outcomes on the held-out evaluation split. The central claim remains an empirical observation rather than a self-referential identity. This is the expected non-finding for a standard ML customization paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that a unified multimodal backbone plus the described two-stage procedure can enforce cross-modal consistency from limited per-character data. No new physical or mathematical axioms; the framework inherits standard supervised fine-tuning and RLHF-style optimization assumptions.

pith-pipeline@v0.9.0 · 5523 in / 1196 out tokens · 32407 ms · 2026-05-12T01:05:19.771635+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose UniCharacter, a two-stage framework comprising Unified-SFT and Character-GRPO... Character-GRPO employs a reward mechanism to mitigate T2I overfitting while preserving text-image consistency.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Thinking Process

baselines. For T2I generation, multimodal role-play, and text role-play, we sampled 3–5 generated outputs from UniCharacter and each baseline. For the customized T2I generation task, evaluation criteria included the quality of generated images and alignment with both text and characters. For the multimodal role-play task, evaluation criteria encompassed t...

work page 2025
[2]

You must break down your reasoning, connecting the character’s personality, user input, and character response to the specific visual elements present in the image

‘thinking process‘: A detailed, step-by-step explanation of how the provided image successfully visualizes the scene described in the text. You must break down your reasoning, connecting the character’s personality, user input, and character response to the specific visual elements present in the image. Explain the character’s expression, pose, the settin...

work page
[3]

text": "[Character Content]

‘generation instruction‘: A concise and effective text-to-image prompt that would generate an image as similar as possible to the one provided. It should be a single string of comma-separated keywords and phrases that accurately captures the image’s subject, art style, composition, lighting, and key details. The generation instruction should not be longer...

work page
[4]

Background/Origin: e.g., character’s background, species, role, etc

work page
[5]

Personality/Traits: Character’s defining personality characteristics

work page
[6]

Abilities/Skills: What the character can do or is known for

work page
[7]

Relationships: Connections to other characters, places, or things

work page
[8]

Preferences/Habits: What the character likes, dislikes, or commonly does

work page
[9]

Appearance: Physical characteristics or distinctive features

work page
[10]

History/Timeline: Past events or milestones in the character’s life

work page
[11]

Quotes/Speech: Notable phrases or speaking patterns

work page
[12]

Goals/Motivations: What drives the character

work page
[13]

All of the above

Miscellaneous Facts: Other unique or interesting details. - Difficulty Mix: The set of 10 questions MUST have this exact distribution: 4 easy, 4 medium, 2 hard. Question and Options Requirements - Clarity: Each question must be clear and unambiguous. - Format: Provide exactly 4 options: A, B, C, D. Only ONE option can be correct. - Plausible Distractors: ...

work page
[14]

Character Profile: [Character Profile]

work page
[15]

Character Image: ¡image¿

work page
[16]

qa_pairs

Dialogue: User Input: [user input] Character Response: [character response] Task Requirements: Generate a total of 20 question-and-answer pairs for [Character Name]. The questions should be answerable by referencing the [Character Profile] in conjunction with the [Character Image] and [Dialogue]. The set of 20 QA pairs must follow this distribution: - 15 ...

work page
[17]

Appearance/Attributes: e.g., hair, clothing, color, accessories

work page
[18]

Action/Pose/Gesture: The character’s current physical stance or action

work page
[19]

Expression/Emotion: The facial expression or implied emotion

work page
[20]

Context/Relevant Element: A background or foreground element directly interacting with or framing the character

work page
[21]

id": "q1

Fine Detail/Reasoning: A question requiring counting, relative positioning, or identifying a small, specific detail. - Difficulty Mix: The set of 5 questions MUST have this exact distribution: 2 easy, 2 medium, 1 hard. Question and Options Requirements - Clarity: Each question must be clear and unambiguous. - Format: Provide exactly 4 options: A, B, C, D....

work page 2014
[22]

Read through the profile, background, example dialogues, and write the personalities and preferences of the real character

work page
[23]

Read through the interactions and identify the personalities and preferences of the AI assistant

work page
[24]

Look for any consistencies or inconsistencies

After having a clear understanding of the interactions, compare the responses to the profile. Look for any consistencies or inconsistencies. Do the responses reflect the character’s personalities and preferences?

work page
[25]

1 being not at all reflective of the character’s personalities, and 7 being perfectly reflective of the character’s personalities

Use the given scale from 1-7 to rate how well the response reflects the personalities and preferences of the character. 1 being not at all reflective of the character’s personalities, and 7 being perfectly reflective of the character’s personalities. *** First, write out in a step by step manner your reasoning about the criterion to be sure that your conc...

work page
[26]

Read through the profile and interaction and identify the key points related to the character

work page
[27]

Check if the responses are consistent with the character’s profile, background, and known facts about the character

Read through the responses of the AI assistant and compare them to the profile. Check if the responses are consistent with the character’s profile, background, and known facts about the character

work page
[28]

Detailed responses are more factual and contribute positively to the score

Check whether the responses provide detailed facts about the character or if they are generic responses that could apply to any character. Detailed responses are more factual and contribute positively to the score

work page
[29]

*** First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct

Rate the performance of the AI on a scale of 1-7 for factual correctness, where 1 is the lowest and 7 is the highest based on the Evaluation Criteria. *** First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print the score on i...

work page
[30]

Read through all the provided user inputs and the corresponding model responses

work page
[31]

Identify any recurring phrases, sentence structures, or response templates

Analyze the language, structure, and content of the model’s responses across the different interactions. Identify any recurring phrases, sentence structures, or response templates

work page
[32]

Does the assistant provide unique, context-specific answers for each input, or does it rely on a limited set of formulaic responses?

Assess the degree of repetition. Does the assistant provide unique, context-specific answers for each input, or does it rely on a limited set of formulaic responses?

work page
[33]

1 means the responses are highly repetitive and templated, while 7 means the responses are highly varied, creative, and tailored to each specific user input

Use the given scale from 1-7 to rate the diversity of the responses. 1 means the responses are highly repetitive and templated, while 7 means the responses are highly varied, creative, and tailored to each specific user input. *** First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid...

work page