arxiv: 2604.11103 · v2 · submitted 2026-04-13 · 💻 cs.SD · cs.AI

Recognition: unknown

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

Xi Chen , Wei Xue , Yike Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords speech role-playingmulti-agent frameworkemotional reasoningchain-of-thoughtActorMindBenchspontaneous dialoguehuman-like performancevoice interaction

0 comments

The pith

ActorMind uses four agents modeled on human actors to generate spontaneous, emotionally fitting speech for role-playing scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ActorMind as a multi-agent reasoning system that lets language models handle speech role-playing by breaking the task into sequential steps drawn from theater performance. It first absorbs a role description, then parses emotional signals from ongoing dialogue, constructs an internal emotional state, and finally produces spoken output carrying that state. A new hierarchical benchmark called ActorMindBench supplies test material at utterance, scene, and full-role scales. The central goal is to move role-playing beyond static text into natural, context-aware speech that feels personalized to the character and situation.

Core claim

ActorMind is an off-the-shelf, multi-agent, chain-of-thought style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

What carries the argument

The four-agent chain-of-thought pipeline (Eye for role reading, Ear for dialogue emotion detection, Brain for state description, Mouth for emotional speech output) that turns static role information and spoken context into spontaneous, trait-infused responses.

If this is right

Models can now produce speech responses that carry personalized verbal traits tied to a specific role, scene, and spoken dialogue history.
Evaluation of speech role-playing becomes possible at three nested scales: individual utterances, full scenes, and entire character arcs.
The same decomposition supplies a reusable template for injecting emotional state into any spoken dialogue system.
Direct comparison on ActorMindBench shows measurable gains over standard prompting without the agent chain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The structure could transfer to other creative speech tasks such as audiobook narration or interactive storytelling where emotional consistency matters.
Because the agents are off-the-shelf, the method could be plugged into existing voice assistants to make them feel more like distinct characters.
If the emotional-state description step proves robust, similar intermediate representations might improve controllability in text-to-speech systems beyond role-play.

Load-bearing premise

That dividing the actor's process into separate Eye, Ear, Brain, and Mouth agents plus the chosen benchmark levels actually reflect the main mechanisms that make human speech role-playing natural and effective.

What would settle it

A head-to-head listening test in which raters judge whether responses from the full ActorMind pipeline sound more spontaneous, consistent with the role, and emotionally appropriate than identical base models without the four-agent steps, showing no reliable difference.

Figures

Figures reproduced from arXiv: 2604.11103 by Wei Xue, Xi Chen, Yike Guo.

**Figure 2.** Figure 2: Overview of ActorMind. ActorMind operates in a multi-agent chain-of-thought reasoning style. Specifi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Spectrogram Comparison of baselines and ActorMind. All samples are generated for Phoebe performing "...So, um, do you think he’s doing any better than he was this morning?" under the same scene and context. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: ActorMindBench Example Data Episode Rachel Monica Phoebe Joey Chandler Ross OTHERS TOTAL num duration num duration num duration num duration num duration num duration num duration num duration SE01_01 86 0:03:15 82 0:03:08 22 0:00:54 39 0:01:42 33 0:01:23 63 0:02:45 25 0:01:07 350 0:14:13 SE01_02 56 0:02:36 37 0:01:15 21 0:00:44 11 0:00:23 29 0:01:10 101 0:04:16 97 0:03:44 352 0:14:09 SE01_03 32 0:01:12 76… view at source ↗

**Figure 5.** Figure 5: Prompt for scene captioning: black text indi [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for role profile generation: black text [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 8.** Figure 8: Prompt for the Brain Agent used for role injection, contextual understanding, and emotion rendering. Here, yellow content represents what the Eye Agent saw, blue content represents what the Ear Agent heard, and purple content represents what the Brain Agent inferred. voice seems to be from a different speaker, or if the text content differs from the reference, directly assign a score of 1. Evaluation Proce… view at source ↗

read the original abstract

Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActorMind introduces a four-agent pipeline and hierarchical benchmark for speech role-playing, but the effectiveness claim rests on unspecified experiments without baselines or metrics.

read the letter

The main point is that this paper defines speech role-playing as spontaneous, role-specific spoken responses and offers ActorMind, a four-agent setup that splits the task into reading the role, processing dialogue cues, generating an emotional state, and producing the output. It also releases ActorMindBench with 7,653 utterances across 313 scenes and 6 roles at utterance, scene, and role levels. That combination is the concrete addition over prior text-only role-playing work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ActorMindBench, a hierarchical benchmark for speech role-playing (7,653 utterances across 313 scenes and 6 roles), and ActorMind, an off-the-shelf multi-agent chain-of-thought framework that decomposes reasoning into Eye Agent (role description), Ear Agent (emotional cues from dialogue), Brain Agent (emotional state), and Mouth Agent (script delivery). It claims that this emulates human actor reasoning and that experimental results demonstrate its effectiveness for spontaneous, personalized speech responses.

Significance. If the experimental claims hold with proper validation, the work could meaningfully advance speech-based role-playing by shifting from text-only approaches to speech modalities and by supplying both a structured reasoning framework and a dedicated benchmark. This has potential value for human-machine interaction and sociological studies, provided the agent decomposition and metrics capture genuine improvements in spontaneity and naturalness rather than prompting artifacts.

major comments (2)

[Abstract] Abstract: The central claim that 'Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing' is unsupported by any quantitative results, baselines (e.g., direct LLM prompting), error bars, objective metrics for spontaneity/naturalness, ablations of the four-agent split, or details on benchmark construction and human validation. This absence makes the effectiveness claim unverifiable and load-bearing for the paper's contribution.
[Abstract] The manuscript provides no evidence or ablations testing whether the Eye-Ear-Brain-Mouth decomposition captures key human actor mechanisms (emotional cue comprehension, personalized verbal traits) versus simply being a more elaborate prompting strategy. Without such tests or comparisons, the framework's claimed advantage over simpler methods remains ungrounded.

minor comments (2)

[Abstract] Abstract contains a typo: 'chain-of-though' should be 'chain-of-thought'.
[Abstract] Abstract grammar: 'a hierarchical benchmark comprises' should be rephrased for correctness (e.g., 'consists of' or 'comprises').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We respond to the major comments below and have made revisions to strengthen the presentation of our results and framework.

read point-by-point responses

Referee: The central claim that 'Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing' is unsupported by any quantitative results, baselines (e.g., direct LLM prompting), error bars, objective metrics for spontaneity/naturalness, ablations of the four-agent split, or details on benchmark construction and human validation. This absence makes the effectiveness claim unverifiable and load-bearing for the paper's contribution.

Authors: We acknowledge that the abstract does not include specific quantitative details or references to sections. The full manuscript describes the benchmark construction in Section 3 and presents experimental results in Section 4, including comparisons to baselines like direct prompting. However, to address the concern, we will revise the abstract to summarize key findings with metrics and error bars, and ensure all details on human validation are clearly stated. This will make the effectiveness claim properly supported. revision: yes
Referee: The manuscript provides no evidence or ablations testing whether the Eye-Ear-Brain-Mouth decomposition captures key human actor mechanisms (emotional cue comprehension, personalized verbal traits) versus simply being a more elaborate prompting strategy. Without such tests or comparisons, the framework's claimed advantage over simpler methods remains ungrounded.

Authors: The decomposition is motivated by how human actors prepare for roles, as described in the introduction and related work. The experiments include comparisons to simpler prompting strategies, showing improvements. To directly test if it captures the mechanisms, we will add ablations of the agent split and a discussion of how each agent addresses specific aspects like emotional cues. We agree this strengthens the claim and will include it in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and benchmark are independently specified without self-referential reduction

full rationale

The paper defines ActorMind as a four-agent chain-of-thought prompting framework (Eye reads role description, Ear processes emotional cues, Brain generates emotional state, Mouth produces output) and ActorMindBench as a hierarchical dataset (7,653 utterances, 313 scenes, 6 roles). No equations, derivations, fitted parameters, or first-principles results exist. Effectiveness is asserted via experimental results on the custom benchmark rather than any construction that reduces the output to the input by definition. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The agent decomposition and benchmark are presented as author-designed constructs for evaluation, not as tautological or statistically forced predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the untested assumption that human actor reasoning can be decomposed into the four sequential agents described and that this decomposition yields better speech output than direct prompting. No free parameters, axioms, or invented entities with independent evidence are stated in the abstract.

axioms (1)

domain assumption Human actors reason about role, scene, and dialogue in a sequential perception-to-action pipeline that can be emulated by separate specialized agents.
Invoked in the description of Eye, Ear, Brain, and Mouth agents.

invented entities (4)

Eye Agent no independent evidence
purpose: Reads assigned role description
Component of the proposed framework; no independent evidence provided.
Ear Agent no independent evidence
purpose: Comprehends emotional cues in spoken dialogue
Component of the proposed framework; no independent evidence provided.
Brain Agent no independent evidence
purpose: Generates descriptive emotional state
Component of the proposed framework; no independent evidence provided.
Mouth Agent no independent evidence
purpose: Delivers scripts infused with emotion
Component of the proposed framework; no independent evidence provided.

pith-pipeline@v0.9.0 · 5519 in / 1417 out tokens · 41313 ms · 2026-05-10T15:58:32.793909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 25 canonical work pages · 10 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

On the performing arts: The anatomy of their economic problems

1965. On the performing arts: The anatomy of their economic problems. The American economic review, 55(1/2):495--502

1965
[4]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

2020
[6]

Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren G \"o lge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International conference on machine learning, pages 2709--2720. PMLR

2022
[7]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. Chateval: Towards better llm-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations

2024
[8]

Yupeng Chang, Yi Chang, and Yuan Wu. 2026. https://openreview.net/forum?id=q0X9SiXiRO BA -lo RA : Bias-alleviating low-rank adaptation to mitigate catastrophic inheritance in large language models . In The Fourteenth International Conference on Learning Representations

2026
[9]

Yupeng Chang, Chenlu Guo, Yi Chang, and Yuan Wu. 2025. Lora-mgpo: Mitigating double descent in low-rank adaptation via momentum-guided perturbation optimization. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 648--659

2025
[10]

Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, and Dakuo Wang. 2025. Towards a design guideline for rpa evaluation: A survey of large language model-based role-playing agents. CoRR

2025
[11]

Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. 2022 a . Large language models meet harry potter: A bilingual dataset for aligning dialogue agents with characters. arXiv preprint arXiv:2211.06869

work page arXiv 2022
[12]

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. 2024 a . Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers. arXiv preprint arXiv:2406.05370

work page arXiv 2024
[13]

Xi Chen. 2024. Mmrbn: Rule-based network for multimodal emotion recognition. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8200--8204

2024
[14]

Xi Chen, Yongwei Gao, and Wei Li. 2022 b . Singing voice detection via similarity-based semi-supervised learning. In Proceedings of the 4th ACM International Conference on Multimedia in Asia, MMAsia '22, New York, NY, USA. Association for Computing Machinery

2022
[15]

Xi Chen and Min Zeng. 2025. Prototype conditioned generative replay for continual learning in NLP . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12754--12770, Albuquerque, New Mexico. Association for Computational Li...

2025
[16]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2024 b . F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. arXiv preprint arXiv:2410.06885

work page arXiv 2024
[17]

Min Chu and Hu Peng. 2006. Objective measure for estimating mean opinion score of synthesized speech. US Patent 7,024,362

2006
[18]

Yanqi Dai, Huanran Hu, Lei Wang, Shengjie Jin, Xu Chen, and Zhiwu Lu. 2024. Mmrole: A comprehensive framework for developing and evaluating multimodal role-playing agents. arXiv preprint arXiv:2408.04203

work page arXiv 2024
[19]

Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. 2025. Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system. arXiv preprint arXiv:2502.05512

work page arXiv 2025
[20]

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page arXiv 2024
[21]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

2024
[22]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491--6501

2024
[23]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. 2026. Magma: A multi-graph based agentic memory architecture for ai agents. arXiv preprint arXiv:2601.03236

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Yuxuan Jiang and Francis Ferraro. 2026. https://arxiv.org/abs/2601.03555 Scribe: Structured mid-level supervision for tool-using language models . Preprint, arXiv:2601.03555

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Yuxuan Jiang, Dawei Li, and Frank Ferraro. 2025. Drp: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models. arXiv preprint arXiv:2505.13975

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, and Wei Xue. 2026. Inference-time scaling for diffusion-based audio super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 14982--14990

2026
[28]

Linus Johansson. 2025. Open weight large language models as a design material in rpgs

2025
[29]

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. 2025. M iv: Towards efficient and fine-grained multimodal in-context learning via representation engineering. arXiv preprint arXiv:2504.04633

work page arXiv 2025
[30]

Guoming Ling, Zhongzhan Huang, Yupei Lin, Junxin Li, Shanshan Zhong, Hefeng Wu, and Liang Lin. 2026. Neural chain-of-thought search: Searching the optimal reasoning path to enhance large language models. arXiv preprint arXiv:2601.11340

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Xiaoxu Ma, Xiangbo Zhang, and Zhenyu Weng. 2026. Stable and explainable personality trait evaluation in large language models with internal activations. arXiv preprint arXiv:2601.09833

work page arXiv 2026
[32]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025 a . Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129--6139

2025
[33]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025 b . Evaluation and benchmarking of llm agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129--6139

2025
[34]

Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. 2024. Rolellm: benchmarking, eliciting, and enhancing role-playing abilities of large language models. Findings of the Association for Computational Linguistics: ACL 2024, pages 14743--14777

2024
[35]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332

2025
[36]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023
[37]

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-llm: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153--13187

2023
[38]

Konstantin Stanislavski and Jean Benedetti. 2009. An actor's work on a role. Routledge

2009
[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017
[40]

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. 2025. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710

work page arXiv 2025
[41]

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652

work page internal anchor Pith review arXiv 2021
[42]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

2022
[43]

Chen, and Roy Ka-Wei Lee

Zheng Weihua, Xin Huang, Zhengyuan Liu, Tarun Kumar Vangani, Bowei Zou, Xiyan Tao, Yuhao Wu, AiTi Aw, Nancy F. Chen, and Roy Ka-Wei Lee. 2026. Adamcot: Rethinking cross-lingual factual reasoning through adaptive multilingual chain-of-thought. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40):33863--33871

2026
[44]

Zheng Weihua, Roy Ka-Wei Lee, Zhengyuan Liu, Wu Kui, AiTi Aw, and Bowei Zou. 2025. CCL - XC o T : An efficient cross-lingual knowledge transfer method for mitigating hallucination generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1768--1788, Suzhou, China. Association for Computational Linguistics

2025
[45]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215

work page internal anchor Pith review arXiv 2025
[46]

Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shi-Xiong Zhang, Guangzhi Li, Yi Luo, and Rongzhi Gu. 2024. Secap: Speech emotion captioning with large language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19323--19331

2024
[47]

Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, and Eduard Hoy. 2026 a . Tooltree: Efficient llm agent tool planning via dual-feedback monte carlo tree search and bidirectional pruning. arXiv preprint arXiv:2603.12740

work page arXiv 2026
[48]

Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, and Eduard Hovy. 2026 b . Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection. arXiv preprint arXiv:2603.04900

work page arXiv 2026
[49]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations

2022
[50]

Haohan Yuan and Haopeng Zhang. 2025. Understanding llm reasoning for abstractive summarization. arXiv preprint arXiv:2512.03503

work page arXiv 2025
[51]

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. 2025. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685

work page arXiv 2025
[52]

Wenyuan Zhang, Tianyun Liu, Mengxiao Song, Xiaodong Li, and Tingwen Liu. 2025 a . SOTOPIA - : Dynamic strategy injection learning and social instruction following evaluation for social agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24669--24697, Vienna, Austria. Association...

2025
[53]

Wenyuan Zhang, Shuaiyi Nie, Jiawei Sheng, Zefeng Zhang, Xinghua Zhang, Yongquan He, and Tingwen Liu. 2025 b . Revealing and mitigating the challenge of detecting character knowledge errors in llm role-playing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33267--33290

2025
[54]

Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, and Zikai Song. 2026 a . Logical phase transitions: Understanding collapse in llm logical reasoning. arXiv preprint arXiv:2601.02902

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025 c . https://doi.org/10.18653/v1/2025.findings-acl.468 ga-s^3 : Comprehensive social network simulation with group agents . In Findings of the Association for Computational Linguistics: ACL 2025, pages 8950--8970, Vienna, Austria. Association for Computati...

work page doi:10.18653/v1/2025.findings-acl.468 2025
[56]

BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

Zhongxing Zhang, Emily K. Vraga, Jisu Huh, and Jaideep Srivastava. 2026 b . https://arxiv.org/abs/2604.06022 Bimind: A dual-head reasoning model with attention-geometry adapter for incorrect information detection . Preprint, arXiv:2604.06022

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Wei Zhu, Zhiwen Tang, and Kun Yue. 2026. Symphony: Synergistic multi-agent planning with heterogeneous language model assembly. arXiv preprint arXiv:2601.22623

work page arXiv 2026