arxiv: 2604.13804 · v1 · submitted 2026-04-15 · 💻 cs.LG

Recognition: unknown

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

Dongjie Fu , Fangming Feng , Xize Cheng , Linjun Li , Zhou Zhao , Tao Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords audio large language modelsrole-playing evaluationcharacter alignmentreinforcement learningmultimodal evaluationparalinguistic informationspeech dialogue systemsRoleChat dataset

0 comments

The pith

Audio LLMs trained via reinforcement learning can judge how well speech matches a character's traits across multiple dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of evaluating character consistency in voice role-playing systems, where paralinguistic features in speech are difficult to quantify objectively. It introduces RoleJudge as a framework that applies audio large language models to rate alignment between spoken output and defined character attributes in a structured, multidimensional way. The authors also release RoleChat, a new dataset of authentic and generated speech samples paired with chain-of-thought reasoning labels. Training proceeds through multiple stages that incorporate reinforcement learning with an added Standard Alignment step to reduce reward misalignment. Experiments show that the resulting RoleJudge model achieves higher accuracy and more favorable subjective ratings than baseline evaluators.

Core claim

RoleJudge is an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. RoleChat is introduced as the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations. A multi-stage training paradigm that incorporates Standard Alignment in reinforcement learning mitigates reward misalignment, and experimental results demonstrate that RoleJudge outperforms various baseline models in both accuracy and subjective assessment.

What carries the argument

RoleJudge framework, which fine-tunes audio large language models on the RoleChat dataset through multi-stage training and reinforcement learning with Standard Alignment to produce multidimensional scores for speech-character consistency.

If this is right

Role-playing speech systems can be evaluated more consistently without relying solely on human judges.
Developers gain a tool to measure and improve how well vocal features convey intended character traits.
Multimodal models can be assessed on both textual and acoustic dimensions of character consistency.
Training of audio LLMs for interactive dialogue benefits from reduced reward misalignment during optimization.
The RoleChat dataset supports further research on voice-based character simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar RL-based judge training could be applied to evaluate other paralinguistic attributes such as emotion or intent.
Automated judges might enable real-time feedback loops during character voice generation.
The approach could reduce the cost of large-scale benchmarking for emerging voice role-play applications.
If extended, the framework might support safety checks against inconsistent or misleading character portrayals.

Load-bearing premise

Audio large language models can reliably and without bias quantify paralinguistic cues to measure how well speech aligns with a character's defined attributes.

What would settle it

Collect independent human ratings of character alignment for a set of speech samples and check whether RoleJudge's automatic scores show large, systematic disagreement with the human consensus.

Figures

Figures reproduced from arXiv: 2604.13804 by Dongjie Fu, Fangming Feng, Linjun Li, Tao Jin, Xize Cheng, Zhou Zhao.

**Figure 1.** Figure 1: RoleChat encompasses five evaluation dimensions: Logical Coherence, which assesses the logical soundness of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overall architecture of RoleJudge. It comprises initial model supervised fine-tuning and standard alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity analysis of the Standard Alignment mechanism on the validation set. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoleJudge and RoleChat give a new dataset plus framework for judging audio role-playing, but the results rest on thin evidence and an untested assumption about unbiased paralinguistic scoring.

read the letter

The main point is that this paper supplies RoleChat, the first voice role-playing dataset with chain-of-thought annotations, and RoleJudge, a framework that trains audio LLMs to score character alignment across speech and text dimensions. They add a multi-stage training step plus standard alignment in reinforcement learning to reduce reward issues during optimization. That combination is a concrete addition to the audio LLM evaluation space, where most work still leans on text-only checks or subjective human ratings that do not scale well. The dataset mixes real and generated samples, which helps with diversity, and the multidimensional scoring idea directly targets the paralinguistic gap the abstract describes. Those pieces are useful building blocks for anyone working on interactive speech agents. The results section claims higher accuracy and better subjective scores than baselines, which at least shows they ran the experiments rather than stopping at the idea stage. The soft spots sit in the validation. The abstract gives no numbers, no list of baselines, no error bars, and no ablation on whether the RL step actually removes bias or just shifts it. The central assumption—that the audio LLM can reliably extract tone, prosody, and emotion without its own systematic skew on certain voices or character types—receives no external check or control experiment. If that assumption does not hold, the reported outperformance does not confirm the framework works; it may simply echo the model's existing preferences. This is a real limitation, not a minor detail, because the whole evaluation loop depends on it. The paper is aimed at researchers building or benchmarking multimodal role-playing systems who need better automated metrics than current text proxies. Readers who already work on audio LLMs or evaluation datasets will find the resources and the training recipe worth examining. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, even though the experiments will need more detail and targeted checks on judgment reliability before the claims can be taken as settled. I would send it to review with a request for those controls and the missing quantitative breakdowns.

Referee Report

2 major / 2 minor

Summary. The paper presents RoleJudge, an evaluation framework that uses audio large language models to assess alignment between speech and character attributes in role-playing agents across multiple modalities and dimensions. It introduces RoleChat, the first voice role-playing evaluation dataset with chain-of-thought reasoning annotations consisting of authentic and LLM-generated speech samples. The authors describe a multi-stage training paradigm and the incorporation of Standard Alignment in reinforcement learning to mitigate reward misalignment. The experimental results show that RoleJudge outperforms various baseline models in accuracy and subjective assessment, validating the effectiveness of the multidimensional evaluation framework.

Significance. If the results hold, this work could be significant for advancing evaluation methods in multimodal AI and role-playing speech systems by tackling the quantification of paralinguistic features for character consistency. The creation of the RoleChat dataset and the multi-stage RL training approach with Standard Alignment are clear strengths that could support future research in audio LLMs, provided the core assumptions are rigorously tested.

major comments (2)

[Abstract and Experimental Results] Abstract and Experimental Results: The central claim that RoleJudge validates the multidimensional framework depends on audio LLMs reliably and unbiasedly quantifying paralinguistic information (tone, prosody, emotion) to assess character alignment. The experiments provide no ablations, controls, or external validation showing that the multi-stage training and Standard Alignment RL removes rather than amplifies model-specific biases in paralinguistic interpretation. This assumption is load-bearing for interpreting outperformance as framework validation.
[Abstract] Abstract: The assertion of outperformance on accuracy and subjective assessment supplies no specific metrics, baselines, error bars, or data details, which prevents direct assessment of whether the results support the claims.

minor comments (2)

The abstract would be strengthened by briefly noting key quantitative improvements (e.g., accuracy deltas) even if full tables appear later.
[Method] Clarify the exact definition and implementation of 'Standard Alignment' in the RL stage to avoid ambiguity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We have carefully considered the major comments and provide point-by-point responses below, along with our plans for revision.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results: The central claim that RoleJudge validates the multidimensional framework depends on audio LLMs reliably and unbiasedly quantifying paralinguistic information (tone, prosody, emotion) to assess character alignment. The experiments provide no ablations, controls, or external validation showing that the multi-stage training and Standard Alignment RL removes rather than amplifies model-specific biases in paralinguistic interpretation. This assumption is load-bearing for interpreting outperformance as framework validation.

Authors: We agree that demonstrating the effectiveness of the multi-stage training and Standard Alignment in mitigating biases is crucial for validating our claims. The current manuscript includes comparative results showing improved performance with these techniques, but we acknowledge the lack of explicit ablations for bias analysis. In the revised manuscript, we will add dedicated ablation studies and controls, including comparisons of paralinguistic feature extraction with and without Standard Alignment, as well as correlation analyses with human judgments to provide external validation. This will address the concern that the training might amplify biases. revision: yes
Referee: [Abstract] Abstract: The assertion of outperformance on accuracy and subjective assessment supplies no specific metrics, baselines, error bars, or data details, which prevents direct assessment of whether the results support the claims.

Authors: We appreciate this observation regarding the abstract. While the full experimental section provides detailed metrics, baselines, and statistical information, the abstract was kept concise. We will revise the abstract to include specific performance metrics (such as accuracy scores and improvements over baselines), list the main baselines, and reference the presence of error bars and subjective assessment details to better support the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with new dataset and RL training

full rationale

The paper introduces RoleJudge as a new evaluation framework for audio LLMs in role-playing, creates the RoleChat dataset with chain-of-thought annotations, and applies multi-stage training plus Standard Alignment RL. Claims of outperformance are supported by direct experimental accuracy and subjective metrics on this dataset, without any derivation step that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The central validation rests on external comparisons to baselines rather than tautological inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim depends on the new framework and dataset being effective, with domain assumptions about LLM evaluation capabilities and no explicit free parameters or invented scientific entities beyond the named tools.

axioms (1)

domain assumption Audio large language models can systematically assess alignment between speech and character across multiple modalities and dimensions
Directly invoked when introducing RoleJudge to address quantification difficulties of paralinguistic information.

invented entities (2)

RoleJudge no independent evidence
purpose: Multimodal evaluation framework for character alignment in speech role-playing
Newly introduced system whose effectiveness is claimed but lacks independent external validation in the abstract.
RoleChat no independent evidence
purpose: Voice role-playing evaluation dataset with chain-of-thought annotations
Newly constructed dataset mixing authentic and LLM-generated samples, presented as the first of its kind.

pith-pipeline@v0.9.0 · 5477 in / 1406 out tokens · 52946 ms · 2026-05-10T14:16:26.388990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 39 canonical work pages · 11 internal anchors

[1]

Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdoo Yun, Hwaran Lee, and Gunhee Kim. 2024. Timechara: Evaluating point-in-time character hal- lucination of role-playing large language models.arXiv preprint arXiv:2405.18027 (2024)

work page arXiv 2024
[2]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, and Eng Siong Chng. 2025. Audio large language models can be descriptive speech quality evaluators.arXiv preprint arXiv:2501.17202 (2025)

work page arXiv 2025
[4]

Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, et al . 2024. Social- bench: Sociality evaluation of role-playing conversational agents.arXiv preprint arXiv:2403.13679(2024)

work page arXiv 2024
[5]

Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, et al. 2025. FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations.arXiv preprint arXiv:2509.06502(2025)

work page arXiv 2025
[6]

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, et al . 2024. Slam-omni: Timbre- controllable voice interaction system with single-stage training.arXiv preprint arXiv:2412.15649(2024)

work page arXiv 2024
[7]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. 2024. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759(2024)

work page internal anchor Pith review arXiv 2024
[8]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919(2023)

work page internal anchor Pith review arXiv 2023
[9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. arXiv preprint arXiv:2407.05407(2024)

work page internal anchor Pith review arXiv 2024
[11]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. 2025. LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis.arXiv preprint arXiv:2505.02625(2025)

work page arXiv 2025
[13]

Qiming Feng, Qiujie Xie, Xiaolong Wang, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao. 2025. EmoCharacter: Evaluating the Emotional Fidelity of Role-Playing Agents in Dialogues. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1:...

2025
[14]

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. 2025. Audio Flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983(2025)

work page arXiv 2025
[15]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, et al. 2025. WavReward: Spoken Dialogue Models With Generalist Reward Evaluators.arXiv preprint arXiv:2505.09558(2025)

work page arXiv 2025
[17]

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. 2024. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831(2024)

work page arXiv 2024
[18]

Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi Mi, Yay- ing Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, et al . 2023. Chatharuhi: Reviving anime character in reality via large language model.arXiv preprint arXiv:2308.09597(2023)

work page arXiv 2023
[19]

Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. 2024. Advancing large language models to capture varied speaking styles and respond properly in spoken conversations.arXiv preprint arXiv:2402.12786(2024)

work page arXiv 2024
[20]

Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, and Xing Sun. 2025. VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model. arXiv:2505.03739 [cs.CL] https: //arxiv.org/abs/2505.03739

work page arXiv 2025
[21]

Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. 2024. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment.arXiv preprint arXiv:2401.12474(2024)

work page arXiv 2024
[22]

OpenAI. 2024. GPT-4o System Card.https://cdn.openai.com/gpt-4o-system- card.pdf(2024)

2024
[23]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023
[25]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[26]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498

2023
[28]

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-LLM: A Trainable Agent for Role-Playing. arXiv:2310.10158 [cs.CL] https://arxiv.org/ abs/2310.10158

work page arXiv 2023
[29]

Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong. 2023. Roleeval: A bilingual role evaluation benchmark for large language models.arXiv preprint arXiv:2312.16132 (2023)

work page arXiv 2023
[30]

Tongyi SpeechTeam. 2024. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs.arXiv preprint arXiv:2407.04051(2024)

work page arXiv 2024
[31]

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289(2023)

work page arXiv 2023
[32]

Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation.arXiv preprint arXiv:2401.01275(2024)

work page arXiv 2024
[33]

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al . 2023. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. arXiv preprint arXiv:2310.17976(2023)

work page arXiv 2023
[34]

Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. 2023. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.arXiv preprint arXiv:2310.00746(2023)

work page arXiv 2023
[35]

Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, and Zhiyong Wu. 2025. VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents. arXiv:2509.03940 [cs.CL] https://arxiv.org/abs/2509.03940

work page arXiv 2025
[36]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215(2025)

work page internal anchor Pith review arXiv 2025
[37]

Rui Xu, Dakuan Lu, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Wei Chu, and Yinghui Xu. 2024. Mindecho: Role-playing language agents for key opinion leaders.arXiv preprint arXiv:2407.05305(2024)

work page arXiv 2024
[38]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems35 (2022), 24611–24624

2022
[40]

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612(2024)

work page arXiv 2024
[41]

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.arXiv preprint arXiv:2305.11000(2023)

work page arXiv 2023
[42]

Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, et al . 2025. OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Per- sonality Interaction.arXiv preprint arXiv:2505.20277(2025)

work page arXiv 2025
[43]

Pinyi Zhang, Siyu An, Lingfeng Qiao, Yifei Yu, Jingyang Chen, Jie Wang, Di Yin, Xing Sun, and Kai Zhang. 2025. RolePlot: A Systematic Framework for Evaluating and Enhancing the Plot-Progression Capabilities of Role-Playing Agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang C...

work page doi:10.18653/v1/2025.acl-long.603 2025
[44]

Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024. InternLM-XC...

work page arXiv 2024
[45]

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. 2024. Internlm-xcomposer- 2.5: A versatile large vision language model supporting long-contextual input and output.arXiv preprint arXiv:2407.03320(2024)

work page arXiv 2024
[46]

Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Pei Ke, Guanqun Bi, Libiao Peng, et al. 2024. CharacterGLM: Customiz- ing Social Characters with Large Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 1457–1476

2024