pith. machine review for the scientific record. sign in

arxiv: 2604.13804 · v1 · submitted 2026-04-15 · 💻 cs.LG

Recognition: unknown

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords audio large language modelsrole-playing evaluationcharacter alignmentreinforcement learningmultimodal evaluationparalinguistic informationspeech dialogue systemsRoleChat dataset
0
0 comments X

The pith

Audio LLMs trained via reinforcement learning can judge how well speech matches a character's traits across multiple dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of evaluating character consistency in voice role-playing systems, where paralinguistic features in speech are difficult to quantify objectively. It introduces RoleJudge as a framework that applies audio large language models to rate alignment between spoken output and defined character attributes in a structured, multidimensional way. The authors also release RoleChat, a new dataset of authentic and generated speech samples paired with chain-of-thought reasoning labels. Training proceeds through multiple stages that incorporate reinforcement learning with an added Standard Alignment step to reduce reward misalignment. Experiments show that the resulting RoleJudge model achieves higher accuracy and more favorable subjective ratings than baseline evaluators.

Core claim

RoleJudge is an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. RoleChat is introduced as the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations. A multi-stage training paradigm that incorporates Standard Alignment in reinforcement learning mitigates reward misalignment, and experimental results demonstrate that RoleJudge outperforms various baseline models in both accuracy and subjective assessment.

What carries the argument

RoleJudge framework, which fine-tunes audio large language models on the RoleChat dataset through multi-stage training and reinforcement learning with Standard Alignment to produce multidimensional scores for speech-character consistency.

If this is right

  • Role-playing speech systems can be evaluated more consistently without relying solely on human judges.
  • Developers gain a tool to measure and improve how well vocal features convey intended character traits.
  • Multimodal models can be assessed on both textual and acoustic dimensions of character consistency.
  • Training of audio LLMs for interactive dialogue benefits from reduced reward misalignment during optimization.
  • The RoleChat dataset supports further research on voice-based character simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar RL-based judge training could be applied to evaluate other paralinguistic attributes such as emotion or intent.
  • Automated judges might enable real-time feedback loops during character voice generation.
  • The approach could reduce the cost of large-scale benchmarking for emerging voice role-play applications.
  • If extended, the framework might support safety checks against inconsistent or misleading character portrayals.

Load-bearing premise

Audio large language models can reliably and without bias quantify paralinguistic cues to measure how well speech aligns with a character's defined attributes.

What would settle it

Collect independent human ratings of character alignment for a set of speech samples and check whether RoleJudge's automatic scores show large, systematic disagreement with the human consensus.

Figures

Figures reproduced from arXiv: 2604.13804 by Dongjie Fu, Fangming Feng, Linjun Li, Tao Jin, Xize Cheng, Zhou Zhao.

Figure 1
Figure 1. Figure 1: RoleChat encompasses five evaluation dimensions: Logical Coherence, which assesses the logical soundness of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of RoleJudge. It comprises initial model supervised fine-tuning and standard alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter sensitivity analysis of the Standard Alignment mechanism on the validation set. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents RoleJudge, an evaluation framework that uses audio large language models to assess alignment between speech and character attributes in role-playing agents across multiple modalities and dimensions. It introduces RoleChat, the first voice role-playing evaluation dataset with chain-of-thought reasoning annotations consisting of authentic and LLM-generated speech samples. The authors describe a multi-stage training paradigm and the incorporation of Standard Alignment in reinforcement learning to mitigate reward misalignment. The experimental results show that RoleJudge outperforms various baseline models in accuracy and subjective assessment, validating the effectiveness of the multidimensional evaluation framework.

Significance. If the results hold, this work could be significant for advancing evaluation methods in multimodal AI and role-playing speech systems by tackling the quantification of paralinguistic features for character consistency. The creation of the RoleChat dataset and the multi-stage RL training approach with Standard Alignment are clear strengths that could support future research in audio LLMs, provided the core assumptions are rigorously tested.

major comments (2)
  1. [Abstract and Experimental Results] Abstract and Experimental Results: The central claim that RoleJudge validates the multidimensional framework depends on audio LLMs reliably and unbiasedly quantifying paralinguistic information (tone, prosody, emotion) to assess character alignment. The experiments provide no ablations, controls, or external validation showing that the multi-stage training and Standard Alignment RL removes rather than amplifies model-specific biases in paralinguistic interpretation. This assumption is load-bearing for interpreting outperformance as framework validation.
  2. [Abstract] Abstract: The assertion of outperformance on accuracy and subjective assessment supplies no specific metrics, baselines, error bars, or data details, which prevents direct assessment of whether the results support the claims.
minor comments (2)
  1. The abstract would be strengthened by briefly noting key quantitative improvements (e.g., accuracy deltas) even if full tables appear later.
  2. [Method] Clarify the exact definition and implementation of 'Standard Alignment' in the RL stage to avoid ambiguity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We have carefully considered the major comments and provide point-by-point responses below, along with our plans for revision.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results: The central claim that RoleJudge validates the multidimensional framework depends on audio LLMs reliably and unbiasedly quantifying paralinguistic information (tone, prosody, emotion) to assess character alignment. The experiments provide no ablations, controls, or external validation showing that the multi-stage training and Standard Alignment RL removes rather than amplifies model-specific biases in paralinguistic interpretation. This assumption is load-bearing for interpreting outperformance as framework validation.

    Authors: We agree that demonstrating the effectiveness of the multi-stage training and Standard Alignment in mitigating biases is crucial for validating our claims. The current manuscript includes comparative results showing improved performance with these techniques, but we acknowledge the lack of explicit ablations for bias analysis. In the revised manuscript, we will add dedicated ablation studies and controls, including comparisons of paralinguistic feature extraction with and without Standard Alignment, as well as correlation analyses with human judgments to provide external validation. This will address the concern that the training might amplify biases. revision: yes

  2. Referee: [Abstract] Abstract: The assertion of outperformance on accuracy and subjective assessment supplies no specific metrics, baselines, error bars, or data details, which prevents direct assessment of whether the results support the claims.

    Authors: We appreciate this observation regarding the abstract. While the full experimental section provides detailed metrics, baselines, and statistical information, the abstract was kept concise. We will revise the abstract to include specific performance metrics (such as accuracy scores and improvements over baselines), list the main baselines, and reference the presence of error bars and subjective assessment details to better support the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with new dataset and RL training

full rationale

The paper introduces RoleJudge as a new evaluation framework for audio LLMs in role-playing, creates the RoleChat dataset with chain-of-thought annotations, and applies multi-stage training plus Standard Alignment RL. Claims of outperformance are supported by direct experimental accuracy and subjective metrics on this dataset, without any derivation step that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The central validation rests on external comparisons to baselines rather than tautological inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim depends on the new framework and dataset being effective, with domain assumptions about LLM evaluation capabilities and no explicit free parameters or invented scientific entities beyond the named tools.

axioms (1)
  • domain assumption Audio large language models can systematically assess alignment between speech and character across multiple modalities and dimensions
    Directly invoked when introducing RoleJudge to address quantification difficulties of paralinguistic information.
invented entities (2)
  • RoleJudge no independent evidence
    purpose: Multimodal evaluation framework for character alignment in speech role-playing
    Newly introduced system whose effectiveness is claimed but lacks independent external validation in the abstract.
  • RoleChat no independent evidence
    purpose: Voice role-playing evaluation dataset with chain-of-thought annotations
    Newly constructed dataset mixing authentic and LLM-generated samples, presented as the first of its kind.

pith-pipeline@v0.9.0 · 5477 in / 1406 out tokens · 52946 ms · 2026-05-10T14:16:26.388990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 39 canonical work pages · 11 internal anchors

  1. [1]

    Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdoo Yun, Hwaran Lee, and Gunhee Kim. 2024. Timechara: Evaluating point-in-time character hal- lucination of role-playing large language models.arXiv preprint arXiv:2405.18027 (2024)

  2. [2]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  3. [3]

    Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, and Eng Siong Chng. 2025. Audio large language models can be descriptive speech quality evaluators.arXiv preprint arXiv:2501.17202 (2025)

  4. [4]

    Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, et al . 2024. Social- bench: Sociality evaluation of role-playing conversational agents.arXiv preprint arXiv:2403.13679(2024)

  5. [5]

    Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, et al. 2025. FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations.arXiv preprint arXiv:2509.06502(2025)

  6. [6]

    Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, et al . 2024. Slam-omni: Timbre- controllable voice interaction system with single-stage training.arXiv preprint arXiv:2412.15649(2024)

  7. [7]

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. 2024. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759(2024)

  8. [8]

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919(2023)

  9. [9]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  10. [10]

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. arXiv preprint arXiv:2407.05407(2024)

  11. [11]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  12. [12]

    Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. 2025. LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis.arXiv preprint arXiv:2505.02625(2025)

  13. [13]

    Qiming Feng, Qiujie Xie, Xiaolong Wang, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao. 2025. EmoCharacter: Evaluating the Emotional Fidelity of Role-Playing Agents in Dialogues. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1:...

  14. [14]

    Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. 2025. Audio Flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983(2025)

  15. [15]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  16. [16]

    Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, et al. 2025. WavReward: Spoken Dialogue Models With Generalist Reward Evaluators.arXiv preprint arXiv:2505.09558(2025)

  17. [17]

    Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. 2024. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831(2024)

  18. [18]

    Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi Mi, Yay- ing Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, et al . 2023. Chatharuhi: Reviving anime character in reality via large language model.arXiv preprint arXiv:2308.09597(2023)

  19. [19]

    Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. 2024. Advancing large language models to capture varied speaking styles and respond properly in spoken conversations.arXiv preprint arXiv:2402.12786(2024)

  20. [20]

    Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, and Xing Sun. 2025. VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model. arXiv:2505.03739 [cs.CL] https: //arxiv.org/abs/2505.03739

  21. [21]

    Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. 2024. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment.arXiv preprint arXiv:2401.12474(2024)

  22. [22]

    OpenAI. 2024. GPT-4o System Card.https://cdn.openai.com/gpt-4o-system- card.pdf(2024)

  23. [23]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  24. [24]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  25. [25]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  26. [26]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

  27. [27]

    Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498

  28. [28]

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-LLM: A Trainable Agent for Role-Playing. arXiv:2310.10158 [cs.CL] https://arxiv.org/ abs/2310.10158

  29. [29]

    Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong. 2023. Roleeval: A bilingual role evaluation benchmark for large language models.arXiv preprint arXiv:2312.16132 (2023)

  30. [30]

    Tongyi SpeechTeam. 2024. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs.arXiv preprint arXiv:2407.04051(2024)

  31. [31]

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289(2023)

  32. [32]

    Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation.arXiv preprint arXiv:2401.01275(2024)

  33. [33]

    Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al . 2023. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. arXiv preprint arXiv:2310.17976(2023)

  34. [34]

    Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. 2023. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.arXiv preprint arXiv:2310.00746(2023)

  35. [35]

    Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, and Zhiyong Wu. 2025. VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents. arXiv:2509.03940 [cs.CL] https://arxiv.org/abs/2509.03940

  36. [36]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215(2025)

  37. [37]

    Rui Xu, Dakuan Lu, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Wei Chu, and Yinghui Xu. 2024. Mindecho: Role-playing language agents for key opinion leaders.arXiv preprint arXiv:2407.05305(2024)

  38. [38]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  39. [39]

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems35 (2022), 24611–24624

  40. [40]

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612(2024)

  41. [41]

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.arXiv preprint arXiv:2305.11000(2023)

  42. [42]

    Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, et al . 2025. OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Per- sonality Interaction.arXiv preprint arXiv:2505.20277(2025)

  43. [43]

    Pinyi Zhang, Siyu An, Lingfeng Qiao, Yifei Yu, Jingyang Chen, Jie Wang, Di Yin, Xing Sun, and Kai Zhang. 2025. RolePlot: A Systematic Framework for Evaluating and Enhancing the Plot-Progression Capabilities of Role-Playing Agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang C...

  44. [44]

    Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024. InternLM-XC...

  45. [45]

    Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. 2024. Internlm-xcomposer- 2.5: A versatile large vision language model supporting long-contextual input and output.arXiv preprint arXiv:2407.03320(2024)

  46. [46]

    Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Pei Ke, Guanqun Bi, Libiao Peng, et al. 2024. CharacterGLM: Customiz- ing Social Characters with Large Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 1457–1476