pith. machine review for the scientific record. sign in

arxiv: 2604.11103 · v2 · submitted 2026-04-13 · 💻 cs.SD · cs.AI

Recognition: unknown

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords speech role-playingmulti-agent frameworkemotional reasoningchain-of-thoughtActorMindBenchspontaneous dialoguehuman-like performancevoice interaction
0
0 comments X

The pith

ActorMind uses four agents modeled on human actors to generate spontaneous, emotionally fitting speech for role-playing scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ActorMind as a multi-agent reasoning system that lets language models handle speech role-playing by breaking the task into sequential steps drawn from theater performance. It first absorbs a role description, then parses emotional signals from ongoing dialogue, constructs an internal emotional state, and finally produces spoken output carrying that state. A new hierarchical benchmark called ActorMindBench supplies test material at utterance, scene, and full-role scales. The central goal is to move role-playing beyond static text into natural, context-aware speech that feels personalized to the character and situation.

Core claim

ActorMind is an off-the-shelf, multi-agent, chain-of-thought style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

What carries the argument

The four-agent chain-of-thought pipeline (Eye for role reading, Ear for dialogue emotion detection, Brain for state description, Mouth for emotional speech output) that turns static role information and spoken context into spontaneous, trait-infused responses.

If this is right

  • Models can now produce speech responses that carry personalized verbal traits tied to a specific role, scene, and spoken dialogue history.
  • Evaluation of speech role-playing becomes possible at three nested scales: individual utterances, full scenes, and entire character arcs.
  • The same decomposition supplies a reusable template for injecting emotional state into any spoken dialogue system.
  • Direct comparison on ActorMindBench shows measurable gains over standard prompting without the agent chain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The structure could transfer to other creative speech tasks such as audiobook narration or interactive storytelling where emotional consistency matters.
  • Because the agents are off-the-shelf, the method could be plugged into existing voice assistants to make them feel more like distinct characters.
  • If the emotional-state description step proves robust, similar intermediate representations might improve controllability in text-to-speech systems beyond role-play.

Load-bearing premise

That dividing the actor's process into separate Eye, Ear, Brain, and Mouth agents plus the chosen benchmark levels actually reflect the main mechanisms that make human speech role-playing natural and effective.

What would settle it

A head-to-head listening test in which raters judge whether responses from the full ActorMind pipeline sound more spontaneous, consistent with the role, and emotionally appropriate than identical base models without the four-agent steps, showing no reliable difference.

Figures

Figures reproduced from arXiv: 2604.11103 by Wei Xue, Xi Chen, Yike Guo.

Figure 1
Figure 1. Figure 1: Overview of ActorMindBench. ActorMindBench comprises three content levels: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ActorMind. ActorMind operates in a multi-agent chain-of-thought reasoning style. Specifi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spectrogram Comparison of baselines and ActorMind. All samples are generated for Phoebe per￾forming "...So, um, do you think he’s doing any better than he was this morning?" under the same scene and context. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: ActorMindBench Example Data Episode Rachel Monica Phoebe Joey Chandler Ross OTHERS TOTAL num duration num duration num duration num duration num duration num duration num duration num duration SE01_01 86 0:03:15 82 0:03:08 22 0:00:54 39 0:01:42 33 0:01:23 63 0:02:45 25 0:01:07 350 0:14:13 SE01_02 56 0:02:36 37 0:01:15 21 0:00:44 11 0:00:23 29 0:01:10 101 0:04:16 97 0:03:44 352 0:14:09 SE01_03 32 0:01:12 76… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for scene captioning: black text indi [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for role profile generation: black text [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for the Brain Agent used for role injection, contextual understanding, and emotion rendering. Here, yellow content represents what the Eye Agent saw, blue content represents what the Ear Agent heard, and purple content represents what the Brain Agent inferred. voice seems to be from a different speaker, or if the text content differs from the reference, directly assign a score of 1. Evaluation Proce… view at source ↗
read the original abstract

Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ActorMindBench, a hierarchical benchmark for speech role-playing (7,653 utterances across 313 scenes and 6 roles), and ActorMind, an off-the-shelf multi-agent chain-of-thought framework that decomposes reasoning into Eye Agent (role description), Ear Agent (emotional cues from dialogue), Brain Agent (emotional state), and Mouth Agent (script delivery). It claims that this emulates human actor reasoning and that experimental results demonstrate its effectiveness for spontaneous, personalized speech responses.

Significance. If the experimental claims hold with proper validation, the work could meaningfully advance speech-based role-playing by shifting from text-only approaches to speech modalities and by supplying both a structured reasoning framework and a dedicated benchmark. This has potential value for human-machine interaction and sociological studies, provided the agent decomposition and metrics capture genuine improvements in spontaneity and naturalness rather than prompting artifacts.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing' is unsupported by any quantitative results, baselines (e.g., direct LLM prompting), error bars, objective metrics for spontaneity/naturalness, ablations of the four-agent split, or details on benchmark construction and human validation. This absence makes the effectiveness claim unverifiable and load-bearing for the paper's contribution.
  2. [Abstract] The manuscript provides no evidence or ablations testing whether the Eye-Ear-Brain-Mouth decomposition captures key human actor mechanisms (emotional cue comprehension, personalized verbal traits) versus simply being a more elaborate prompting strategy. Without such tests or comparisons, the framework's claimed advantage over simpler methods remains ungrounded.
minor comments (2)
  1. [Abstract] Abstract contains a typo: 'chain-of-though' should be 'chain-of-thought'.
  2. [Abstract] Abstract grammar: 'a hierarchical benchmark comprises' should be rephrased for correctness (e.g., 'consists of' or 'comprises').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We respond to the major comments below and have made revisions to strengthen the presentation of our results and framework.

read point-by-point responses
  1. Referee: The central claim that 'Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing' is unsupported by any quantitative results, baselines (e.g., direct LLM prompting), error bars, objective metrics for spontaneity/naturalness, ablations of the four-agent split, or details on benchmark construction and human validation. This absence makes the effectiveness claim unverifiable and load-bearing for the paper's contribution.

    Authors: We acknowledge that the abstract does not include specific quantitative details or references to sections. The full manuscript describes the benchmark construction in Section 3 and presents experimental results in Section 4, including comparisons to baselines like direct prompting. However, to address the concern, we will revise the abstract to summarize key findings with metrics and error bars, and ensure all details on human validation are clearly stated. This will make the effectiveness claim properly supported. revision: yes

  2. Referee: The manuscript provides no evidence or ablations testing whether the Eye-Ear-Brain-Mouth decomposition captures key human actor mechanisms (emotional cue comprehension, personalized verbal traits) versus simply being a more elaborate prompting strategy. Without such tests or comparisons, the framework's claimed advantage over simpler methods remains ungrounded.

    Authors: The decomposition is motivated by how human actors prepare for roles, as described in the introduction and related work. The experiments include comparisons to simpler prompting strategies, showing improvements. To directly test if it captures the mechanisms, we will add ablations of the agent split and a discussion of how each agent addresses specific aspects like emotional cues. We agree this strengthens the claim and will include it in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and benchmark are independently specified without self-referential reduction

full rationale

The paper defines ActorMind as a four-agent chain-of-thought prompting framework (Eye reads role description, Ear processes emotional cues, Brain generates emotional state, Mouth produces output) and ActorMindBench as a hierarchical dataset (7,653 utterances, 313 scenes, 6 roles). No equations, derivations, fitted parameters, or first-principles results exist. Effectiveness is asserted via experimental results on the custom benchmark rather than any construction that reduces the output to the input by definition. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The agent decomposition and benchmark are presented as author-designed constructs for evaluation, not as tautological or statistically forced predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the untested assumption that human actor reasoning can be decomposed into the four sequential agents described and that this decomposition yields better speech output than direct prompting. No free parameters, axioms, or invented entities with independent evidence are stated in the abstract.

axioms (1)
  • domain assumption Human actors reason about role, scene, and dialogue in a sequential perception-to-action pipeline that can be emulated by separate specialized agents.
    Invoked in the description of Eye, Ear, Brain, and Mouth agents.
invented entities (4)
  • Eye Agent no independent evidence
    purpose: Reads assigned role description
    Component of the proposed framework; no independent evidence provided.
  • Ear Agent no independent evidence
    purpose: Comprehends emotional cues in spoken dialogue
    Component of the proposed framework; no independent evidence provided.
  • Brain Agent no independent evidence
    purpose: Generates descriptive emotional state
    Component of the proposed framework; no independent evidence provided.
  • Mouth Agent no independent evidence
    purpose: Delivers scripts infused with emotion
    Component of the proposed framework; no independent evidence provided.

pith-pipeline@v0.9.0 · 5519 in / 1417 out tokens · 41313 ms · 2026-05-10T15:58:32.793909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 25 canonical work pages · 10 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    On the performing arts: The anatomy of their economic problems

    1965. On the performing arts: The anatomy of their economic problems. The American economic review, 55(1/2):495--502

  4. [4]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  5. [5]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  6. [6]

    Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren G \"o lge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International conference on machine learning, pages 2709--2720. PMLR

  7. [7]

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. Chateval: Towards better llm-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations

  8. [8]

    Yupeng Chang, Yi Chang, and Yuan Wu. 2026. https://openreview.net/forum?id=q0X9SiXiRO BA -lo RA : Bias-alleviating low-rank adaptation to mitigate catastrophic inheritance in large language models . In The Fourteenth International Conference on Learning Representations

  9. [9]

    Yupeng Chang, Chenlu Guo, Yi Chang, and Yuan Wu. 2025. Lora-mgpo: Mitigating double descent in low-rank adaptation via momentum-guided perturbation optimization. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 648--659

  10. [10]

    Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, and Dakuo Wang. 2025. Towards a design guideline for rpa evaluation: A survey of large language model-based role-playing agents. CoRR

  11. [11]

    Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. 2022 a . Large language models meet harry potter: A bilingual dataset for aligning dialogue agents with characters. arXiv preprint arXiv:2211.06869

  12. [12]

    Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. 2024 a . Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers. arXiv preprint arXiv:2406.05370

  13. [13]

    Xi Chen. 2024. Mmrbn: Rule-based network for multimodal emotion recognition. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8200--8204

  14. [14]

    Xi Chen, Yongwei Gao, and Wei Li. 2022 b . Singing voice detection via similarity-based semi-supervised learning. In Proceedings of the 4th ACM International Conference on Multimedia in Asia, MMAsia '22, New York, NY, USA. Association for Computing Machinery

  15. [15]

    Xi Chen and Min Zeng. 2025. Prototype conditioned generative replay for continual learning in NLP . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12754--12770, Albuquerque, New Mexico. Association for Computational Li...

  16. [16]

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2024 b . F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. arXiv preprint arXiv:2410.06885

  17. [17]

    Min Chu and Hu Peng. 2006. Objective measure for estimating mean opinion score of synthesized speech. US Patent 7,024,362

  18. [18]

    Yanqi Dai, Huanran Hu, Lei Wang, Shengjie Jin, Xu Chen, and Zhiwu Lu. 2024. Mmrole: A comprehensive framework for developing and evaluating multimodal role-playing agents. arXiv preprint arXiv:2408.04203

  19. [19]

    Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. 2025. Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system. arXiv preprint arXiv:2502.05512

  20. [20]

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

  21. [21]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

  22. [22]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491--6501

  23. [23]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  24. [24]

    Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. 2026. Magma: A multi-graph based agentic memory architecture for ai agents. arXiv preprint arXiv:2601.03236

  25. [25]

    Yuxuan Jiang and Francis Ferraro. 2026. https://arxiv.org/abs/2601.03555 Scribe: Structured mid-level supervision for tool-using language models . Preprint, arXiv:2601.03555

  26. [26]

    Yuxuan Jiang, Dawei Li, and Frank Ferraro. 2025. Drp: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models. arXiv preprint arXiv:2505.13975

  27. [27]

    Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, and Wei Xue. 2026. Inference-time scaling for diffusion-based audio super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 14982--14990

  28. [28]

    Linus Johansson. 2025. Open weight large language models as a design material in rpgs

  29. [29]

    Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. 2025. M iv: Towards efficient and fine-grained multimodal in-context learning via representation engineering. arXiv preprint arXiv:2504.04633

  30. [30]

    Guoming Ling, Zhongzhan Huang, Yupei Lin, Junxin Li, Shanshan Zhong, Hefeng Wu, and Liang Lin. 2026. Neural chain-of-thought search: Searching the optimal reasoning path to enhance large language models. arXiv preprint arXiv:2601.11340

  31. [31]

    Xiaoxu Ma, Xiangbo Zhang, and Zhenyu Weng. 2026. Stable and explainable personality trait evaluation in large language models with internal activations. arXiv preprint arXiv:2601.09833

  32. [32]

    Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025 a . Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129--6139

  33. [33]

    Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025 b . Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129--6139

  34. [34]

    Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. 2024. Rolellm: benchmarking, eliciting, and enhancing role-playing abilities of large language models. Findings of the Association for Computational Linguistics: ACL 2024, pages 14743--14777

  35. [35]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332

  36. [36]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

  37. [37]

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-llm: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153--13187

  38. [38]

    Konstantin Stanislavski and Jean Benedetti. 2009. An actor's work on a role. Routledge

  39. [39]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

  40. [40]

    Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. 2025. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710

  41. [41]

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652

  42. [42]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  43. [43]

    Chen, and Roy Ka-Wei Lee

    Zheng Weihua, Xin Huang, Zhengyuan Liu, Tarun Kumar Vangani, Bowei Zou, Xiyan Tao, Yuhao Wu, AiTi Aw, Nancy F. Chen, and Roy Ka-Wei Lee. 2026. Adamcot: Rethinking cross-lingual factual reasoning through adaptive multilingual chain-of-thought. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40):33863--33871

  44. [44]

    Zheng Weihua, Roy Ka-Wei Lee, Zhengyuan Liu, Wu Kui, AiTi Aw, and Bowei Zou. 2025. CCL - XC o T : An efficient cross-lingual knowledge transfer method for mitigating hallucination generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1768--1788, Suzhou, China. Association for Computational Linguistics

  45. [45]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215

  46. [46]

    Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shi-Xiong Zhang, Guangzhi Li, Yi Luo, and Rongzhi Gu. 2024. Secap: Speech emotion captioning with large language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19323--19331

  47. [47]

    Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, and Eduard Hoy. 2026 a . Tooltree: Efficient llm agent tool planning via dual-feedback monte carlo tree search and bidirectional pruning. arXiv preprint arXiv:2603.12740

  48. [48]

    Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, and Eduard Hovy. 2026 b . Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection. arXiv preprint arXiv:2603.04900

  49. [49]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations

  50. [50]

    Haohan Yuan and Haopeng Zhang. 2025. Understanding llm reasoning for abstractive summarization. arXiv preprint arXiv:2512.03503

  51. [51]

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. 2025. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685

  52. [52]

    Wenyuan Zhang, Tianyun Liu, Mengxiao Song, Xiaodong Li, and Tingwen Liu. 2025 a . SOTOPIA - : Dynamic strategy injection learning and social instruction following evaluation for social agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24669--24697, Vienna, Austria. Association...

  53. [53]

    Wenyuan Zhang, Shuaiyi Nie, Jiawei Sheng, Zefeng Zhang, Xinghua Zhang, Yongquan He, and Tingwen Liu. 2025 b . Revealing and mitigating the challenge of detecting character knowledge errors in llm role-playing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33267--33290

  54. [54]

    Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, and Zikai Song. 2026 a . Logical phase transitions: Understanding collapse in llm logical reasoning. arXiv preprint arXiv:2601.02902

  55. [55]

    Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025 c . https://doi.org/10.18653/v1/2025.findings-acl.468 ga-s^3 : Comprehensive social network simulation with group agents . In Findings of the Association for Computational Linguistics: ACL 2025, pages 8950--8970, Vienna, Austria. Association for Computati...

  56. [56]

    BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

    Zhongxing Zhang, Emily K. Vraga, Jisu Huh, and Jaideep Srivastava. 2026 b . https://arxiv.org/abs/2604.06022 Bimind: A dual-head reasoning model with attention-geometry adapter for incorrect information detection . Preprint, arXiv:2604.06022

  57. [57]

    Wei Zhu, Zhiwen Tang, and Kun Yue. 2026. Symphony: Synergistic multi-agent planning with heterogeneous language model assembly. arXiv preprint arXiv:2601.22623