pith. machine review for the scientific record. sign in

arxiv: 2603.06194 · v2 · submitted 2026-03-06 · 💻 cs.CL · cs.AI

Recognition: no theorem link

MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multi-turn RLcredit assignmentemotional support dialoguepotential functionLLMincremental distance rewardreinforcement learning for dialogues
0
0 comments X

The pith

MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state for multi-turn emotional support dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MICA as a critic-free reinforcement learning method for long-horizon emotional support conversations with large language models. It addresses sparse rewards and credit assignment challenges by using a potential function on the user's structured support state to generate both immediate and delayed signals. Immediate credit comes from the per-turn reduction in distance to the target state, while delayed credit uses the Monte Carlo return of that same distance measure. These are normalized and combined into a mixed advantage for stable per-turn updates without needing matched state comparisons or rollout trees. Experiments on multiple benchmarks show consistent gains over existing methods like GRPO and REINFORCE++ with no extra computational cost.

Core claim

MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic.

What carries the argument

Shared potential function on the user's structured support state, which produces the Incremental Distance Reward for immediate credit and its Monte Carlo return for delayed credit, mixed after normalization into an advantage signal.

If this is right

  • Supports stable optimization in multi-turn RL tasks lacking matched states or external supervision.
  • Outperforms GRPO and REINFORCE++ on EMPA, EQ-Bench, and EmoBench benchmarks with up to +43.2 improvement on EMPA.
  • Requires no additional rollout cost and maintains robustness to variations in reward judges.
  • Extends RL applicability to interactive LLM scenarios with long-horizon dialogues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might apply to other dialogue domains where progress toward a goal state can be quantified without direct comparisons.
  • By avoiding a learned critic, it could reduce variance and training instability in fine-tuning large models for conversations.
  • If the state representation generalizes, similar potential functions could simplify credit assignment in other sequential decision tasks.

Load-bearing premise

That a structured support state can be defined such that the residual distance to the target state provides a reliable and comparable signal for assigning per-turn credit without matched-state comparisons or external supervision.

What would settle it

Demonstrating that no consistent structured support state exists across different dialogues or that the distance-based rewards do not correlate with human judgments of support quality would invalidate the credit assignment mechanism.

Figures

Figures reproduced from arXiv: 2603.06194 by Hengjie Yang, Jinwei Su, Naifan Zhang, Ruihan Sun, Xiaofan Zhang, Zhaohan Chen, Zhengyuan Pan.

Figure 1
Figure 1. Figure 1: Framework of the MICA. The policy model interacts with Actor to collect multi-turn trajectories, which is then optimized via Mixed-Advantage. to optimize long-term emotional outcomes[34–36], including search-based[37] or reward-model￾based frameworks such as CSO [34], RLVER [16] and Echo-N1 [38]. To evaluate conversation quality, recent benchmarks adopt LLM-as-a-Judge paradigms. SAGE [39] models evolving e… view at source ↗
Figure 2
Figure 2. Figure 2: MICA Overall Framwork. Given an initial prompt, we sample K trajectories from the current policy, each consisting of T turns. The turn-level advantage is computed by normalizing returns across samples at the same turn. The group-level advantage is computed by normalizing rewards over all K × T samples in the group. The final advantage is a convex combination of these two terms, balancing fine-grained credi… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Monte Carlo returns and immediate rewards across dialogue turns at specific training step. (a) Monte Carlo returns exhibit a clear positive correlation with the turn index; (b) In contrast, immediate rewards show no discernible trend across turns. A t  a (i) t  = R (i) t − µt σt (2) where µt = 1 Nt P i∈It R (i) t and σt = r 1 Nt P i∈It  R (i) t − µt 2 are turn-wise mean and standard dev… view at source ↗
Figure 4
Figure 4. Figure 4: Empathy alignment scores across various dimensions. MICA consistently out￾performs GRPO and REINFORCE++ variants across all dimensions and model scales, showing greater alignment gains over Base Model Overview of evaluation metrics. In the EMPA benchmark, evaluation goes beyond scenario￾level pass/fail outcomes. We additionally score each response turn as a three-dimensional coordi￾nate along distinct empa… view at source ↗
Figure 5
Figure 5. Figure 5: Reward and gradient norm curves of Qwen3-8b and Qwen2.5-7b-instruct under various advantage. Mixed Advantage achieves the highest converged reward while maintaining stable gradient norms, demonstrating simultaneous improvements in both reward performance and training stability. D.2 Sensitivity Analysis of Turn-Level Advantage Weight To understand how the relative contribution of long-horizon and per-turn f… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the Turn-level advantage weight α on dialogue strategy and benchmark performance. We sweep α∈ {0.0, 0.1, . . . , 1.0} in the mixed advantage A = αAt + (1 − α)Ab on Qwen2.5-7B-Instruct and run 5 independent trials per configuration. Error bars denote one standard deviation across the 5 trials; small dots denote per-trial values. Increasing α correlates with longer, more stable EMPA dialogues, whil… view at source ↗
Figure 7
Figure 7. Figure 7: shows reward trajectories during MICA training for four base models—Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and Qwen3-32B—while keeping the actor fixed to Qwen3-235B. Each panel corresponds to one base model; within each panel, the Judger varies across Qwen3-235B, MiniMax-M2.5, and GLM-4.7, while the MICA training recipe is otherwise identical across all 12 runs. The x-axis denotes the cumulative number … view at source ↗
Figure 8
Figure 8. Figure 8: Cosine similarity of Judger score directions on fixed EMPA traces. Gemini-2.5-pro generates dialogue trajectories for the 30 EMPA test cases, and the same response turns are rescored by Qwen3-235B, MiniMax-M2.5, and GLM-4.7. Each entry reports the averaged cosine similarity between per-response score-change vectors (∆xt, ∆yt, ∆zt). D.3.2 Cross-Judger Agreement on Score Directions To isolate scoring agreeme… view at source ↗
read the original abstract

Reinforcement learning (RL) for large language models (LLMs) has shown strong performance in single-turn tasks, but extending it to multi-turn interaction remains challenging due to sparse rewards and poor per-turn credit assignment. In emotional support dialogues, responses shape future user states, so matched-state step-wise comparison is unavailable, while trajectory-level supervision is insufficient. We propose MICA (Multi-granularity Intertemporal Credit Assignment), a critic-free RL framework for multi-turn emotional support tasks. MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. On EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B, MICA consistently outperforms GRPO and REINFORCE++, achieving up to +43.2 on EMPA, while adding no rollout cost and remaining robust to reward judges. These results show that turn-aware credit assignment enables effective and practical multi-turn RL for interactive LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes MICA, a critic-free RL framework for multi-turn emotional support dialogues. It defines a potential function over a structured user support state to compute an Incremental Distance Reward (per-turn decrease in residual distance to target) for immediate credit and its Monte Carlo return for delayed credit; after scope-specific normalization these form a mixed advantage enabling stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. Experiments on EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B models report consistent outperformance over GRPO and REINFORCE++, with gains up to +43.2 on EMPA and no added rollout cost.

Significance. If the structured support state and distance metric can be shown to be objectively definable and free of hidden supervision, MICA would provide an efficient, low-overhead solution to intertemporal credit assignment in long-horizon LLM interactions. The reported gains across model scales and robustness to reward judges indicate practical utility for emotional support systems, but the significance depends on validating that the residual-distance signal is comparable across dialogues without external annotations or classifiers.

major comments (3)
  1. [§3] §3 (Method): The construction of the 'structured support state' and the residual distance metric are not specified (e.g., whether via embeddings, rule-based features, or pre-trained classifiers). This is load-bearing for the central claim that Incremental Distance Reward supplies an objective, supervision-free per-turn signal; without the definition the 'no external supervision' guarantee cannot be evaluated.
  2. [§4.2] §4.2 (Normalization): The scope-specific normalization factors are free parameters whose values are not derived or ablated. Because the mixed advantage is formed only after these factors are applied, the reported stability and outperformance may be sensitive to their choice rather than following directly from the potential function.
  3. [Table 2] Table 2 / §5.1: Performance numbers are given without error bars, standard deviations, or statistical tests across the three benchmarks and multiple model sizes. This weakens the claim of consistent outperformance, especially when the central derivation relies on the comparability of the distance signal.
minor comments (1)
  1. [Abstract] Abstract and §1: The phrase 'adding no rollout cost' is unclear without a direct comparison of total wall-clock or token compute against the baselines that do use rollouts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of MICA. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The construction of the 'structured support state' and the residual distance metric are not specified (e.g., whether via embeddings, rule-based features, or pre-trained classifiers). This is load-bearing for the central claim that Incremental Distance Reward supplies an objective, supervision-free per-turn signal; without the definition the 'no external supervision' guarantee cannot be evaluated.

    Authors: We agree the construction details were insufficiently explicit in §3. The structured support state is defined as a fixed-dimensional vector of rule-based features (e.g., counts of emotional valence indicators, support-need categories, and dialogue-turn progress markers extracted via deterministic heuristics from the user utterance and history). The residual distance is the L1 norm between the current state vector and a target state vector representing complete emotional support resolution. No embeddings, pre-trained classifiers, or external annotations are used. In the revision we will add this formal definition, pseudocode, and an example computation to §3 to make the supervision-free property verifiable. revision: yes

  2. Referee: [§4.2] §4.2 (Normalization): The scope-specific normalization factors are free parameters whose values are not derived or ablated. Because the mixed advantage is formed only after these factors are applied, the reported stability and outperformance may be sensitive to their choice rather than following directly from the potential function.

    Authors: The normalization factors are set to the inverse of the expected range of the distance metric within each scope (immediate vs. Monte Carlo) so that the two advantage components have unit variance before mixing. We will revise §4.2 to derive these factors explicitly from the potential-function bounds and add an ablation table showing performance across a range of scaling values, confirming that outperformance holds for any reasonable choice within the derived interval. revision: yes

  3. Referee: [Table 2] Table 2 / §5.1: Performance numbers are given without error bars, standard deviations, or statistical tests across the three benchmarks and multiple model sizes. This weakens the claim of consistent outperformance, especially when the central derivation relies on the comparability of the distance signal.

    Authors: We acknowledge that error bars and statistical tests are missing. In the revision we will rerun all experiments with three random seeds, report mean ± standard deviation in Table 2, and add paired t-test p-values for MICA vs. baselines in §5.1. This will directly address concerns about the reliability of the distance-signal comparability across dialogues. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in MICA derivation

full rationale

The paper introduces Incremental Distance Reward as the per-turn decrease in residual distance to a target state drawn from a shared potential function, then combines it with its Monte Carlo return after scope-specific normalization to form a mixed advantage. This is a direct definitional construction for the credit signal rather than a reduction of an independent prediction or first-principles result back to fitted inputs. No equations or claims in the provided text show the central result being equivalent to its own inputs by construction, no self-citations are used to justify uniqueness or load-bearing premises, and no ansatz is smuggled or known result renamed. The framework is presented as a design for critic-free multi-turn RL with empirical results on EMPA, EQ-Bench, and EmoBench, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the existence of a measurable structured support state and a potential function whose distance metric yields usable per-turn signals; normalization factors are introduced without independent justification in the abstract.

free parameters (1)
  • scope-specific normalization factors
    Used to combine immediate distance reward and Monte Carlo return into the mixed advantage; values are not derived from first principles.
axioms (1)
  • domain assumption A structured support state exists that admits a well-defined residual distance to a target state usable for credit assignment.
    Invoked to define the potential function and Incremental Distance Reward.
invented entities (1)
  • Incremental Distance Reward no independent evidence
    purpose: Provide immediate per-turn credit by measuring decrease in residual distance to target state.
    Newly defined component of the mixed advantage.

pith-pipeline@v0.9.0 · 5555 in / 1442 out tokens · 54696 ms · 2026-05-15T15:20:51.021638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 16 internal anchors

  1. [1]

    Affective computing in the era of large language models: A survey 9 from the nlp perspective.ArXiv, abs/2408.04638, 2024

    Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, and Ge Yu. Affective computing in the era of large language models: A survey 9 from the nlp perspective.ArXiv, abs/2408.04638, 2024. URL https://api.semanticscholar. org/CorpusID:271843516

  2. [2]

    Varshney

    William Kidder, Jason D’Cruz, and Kush R. Varshney. Empathy and the right to be an exception: What llms can and cannot do.ArXiv, abs/2401.14523, 2024. URL https://api.semanticscholar. org/CorpusID:267301044

  3. [3]

    The illusion of empathy: how ai chatbots shape conversation perception

    Tingting Liu, Salvatore Giorgi, Ankit Aich, Allison Lahnala, Brenda Curtis, Lyle Ungar, and João Sedoc. The illusion of empathy: how ai chatbots shape conversation perception. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Sympos...

  4. [4]

    MIME: MIMicking emotions for empathetic response generation

    Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. MIME: MIMicking emotions for empathetic response generation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages ...

  5. [5]

    doi: 10.18653/v1/2020.emnlp-main.721

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.721. URL https: //aclanthology.org/2020.emnlp-main.721/

  6. [6]

    Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , url =

    Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open- domain conversation models: A new benchmark and dataset. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy, July 2019. Association for...

  7. [7]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874, 2021. URLhttps://api.semanticscholar.org/CorpusID:232134851

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar. org/CorpusID:239998651

  9. [9]

    A survey on large language models for code generation.ACM Trans

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Trans. Softw. Eng. Methodol., 35(2), January 2026. ISSN 1049-331X. doi: 10.1145/3747588. URLhttps://doi.org/10.1145/3747588

  10. [10]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online,...

  11. [11]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

  12. [12]

    Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards, 2025

    Ting Yang, Li Chen, and Huimin Wang. Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards, 2025. URLhttps://arxiv.org/abs/2508. 12935

  13. [13]

    Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach

    Jinfeng Zhou, Zhuang Chen, Bo Wang, and Minlie Huang. Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers),...

  14. [14]

    SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi- turn empathy conversations

    Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi- turn empathy conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1170–1...

  15. [15]

    doi: 10.18653/v1/2023.findings-emnlp.83

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.83. URL https://aclanthology.org/2023.findings-emnlp.83/

  16. [16]

    Self-chats from large language models make small emotional support chatbot better

    Zhonghua Zheng, Lizi Liao, Yang Deng, Libo Qin, and Liqiang Nie. Self-chats from large language models make small emotional support chatbot better. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11325–11345, Bangkok, Thailand,...

  17. [17]

    Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards.ArXiv, abs/2508.12935, 2025

    Ting Yang, Li Chen, and Huimin Wang. Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards.ArXiv, abs/2508.12935, 2025. URL https: //api.semanticscholar.org/CorpusID:280677049

  18. [18]

    Rlver: Reinforcement learning with verifiable emotion rewards for empathetic agents, 2025

    Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Rlver: Reinforcement learning with verifiable emotion rewards for empathetic agents, 2025. URL https://arxiv.org/abs/2507.03112

  19. [19]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  20. [21]

    Rtmc: Step-level credit assignment via rollout trees

    Tao Wang, Suhang Zheng, and Xiaoxiao Xu. Rtmc: Step-level credit assignment via rollout trees. 2026. URLhttps://api.semanticscholar.org/CorpusID:287432778

  21. [22]

    Empa: Evaluating persona-aligned empathy as a process, 2026

    Shiya Zhang, Yuhan Zhan, Ruixi Su, Ruihan Sun, Ziyi Song, Zhaohan Chen, and Xiaofan Zhang. Empa: Evaluating persona-aligned empathy as a process, 2026. URL https://arxiv.org/abs/2603. 00552

  22. [23]

    X. Lu, H. M. Schwartz, and S. N. Givigi. Policy invariance under reward transformations for general-sum stochastic games.Journal of Artificial Intelligence Research, 41:397–406, 2011. ISSN 1076-9757. doi: 10.1613/jair.3384. URLhttp://dx.doi.org/10.1613/jair.3384

  23. [24]

    Samuel J. Paech. Eq-bench: An emotional intelligence benchmark for large language models, 2024. URL https://arxiv.org/abs/2312.06281

  24. [25]

    Liu, Jinfeng Zhou, Alvionna S

    Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M. Liu, Jinfeng Zhou, Alvionna S. Sunaryo, Juanzi Li, Tatia M. C. Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models, 2024. URLhttps://arxiv.org/abs/2402.12071

  25. [26]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  26. [27]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  27. [28]

    Towards emotional support dialog systems

    Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. Towards emotional support dialog systems. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural L...

  28. [29]

    Beyond silent letters: Amplifying LLMs in emotion recognition with vocal nuances

    Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. Beyond silent letters: Amplifying LLMs in emotion recognition with vocal nuances. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 2202–2218, Albuquerque, New Mexico, April 2025. Association for C...

  29. [30]

    Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics

    Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, and Bingquan Liu. Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics. InInternational Conference on Computational Linguistics, 2024. URL https://api.semanticscholar.org/ CorpusID:268363554

  30. [31]

    A computational approach to understanding empathy expressed in text-based mental health support

    Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A computational approach to understanding empathy expressed in text-based mental health support. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 5263–5276, Online, November 2020. A...

  31. [32]

    Augesc: Dialogue augmenta- tion with large language models for emotional support conversation

    Chujie Zheng, Sahand Sabour, Jiaxin Wen, Zheng Zhang, and Minlie Huang. Augesc: Dialogue augmenta- tion with large language models for emotional support conversation. InAnnual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID: 258588110

  32. [33]

    SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support

    Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, Miami, Florida, USA, November 2024. Ass...

  33. [34]

    Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation

    Wei Peng, Yue Hu, Luxi Xing, Yuqiang Xie, Yajing Sun, and Yunpeng Li. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. InInternational Joint Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/ CorpusID:248406141

  34. [35]

    Cause-aware empathetic response generation via chain-of-thought fine-tuning.ArXiv, abs/2408.11599,

    Xinhao Chen, Chong Yang, Man Lan, Li Cai, Yang Chen, Tu Hu, Xinlin Zhuang, and Aimin Zhou. Cause-aware empathetic response generation via chain-of-thought fine-tuning.ArXiv, abs/2408.11599,

  35. [36]

    URLhttps://api.semanticscholar.org/CorpusID:271916313

  36. [37]

    Chain of strategy optimization makes large language models better emotional supporter, 2025

    Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, and Ting Liu. Chain of strategy optimization makes large language models better emotional supporter, 2025. URLhttps://arxiv.org/abs/2503.05362

  37. [38]

    Kardia- r1: Unleashing llms to reason toward understanding and empathy for emotional support via rubric-as- judge reinforcement learning

    Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou, and Usman Naseem. Kardia- r1: Unleashing llms to reason toward understanding and empathy for emotional support via rubric-as- judge reinforcement learning. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9230–9240, New York, NY , USA, 2026. Association for Computing Machinery...

  38. [39]

    EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning

    Mingxiu Cai, Daling Wang, Shi Feng, and Yifei Zhang. EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Li...

  39. [40]

    Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements

    Yushan Qian, Weinan Zhang, and Ting Liu. Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6516– 6528, Singapore, December 2023. Association for Comput...

  40. [41]

    Echo-n1: Affective rl frontier, 2025

    Naifan Zhang, Ruihan Sun, Ruixi Su, Shiqi Ma, Shiya Zhang, Xianna Weng, Xiaofan Zhang, Yuhan Zhan, Yuyang Xu, Zhaohan Chen, Zhengyuan Pan, and Ziyi Song. Echo-n1: Affective rl frontier, 2025. URL https://arxiv.org/abs/2512.00344

  41. [42]

    Sentient agent as a judge: Evaluating higher-order social cognition in large language models, 2025

    Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Sentient agent as a judge: Evaluating higher-order social cognition in large language models, 2025. URL https://arxiv.org/ abs/2505.02847

  42. [43]

    Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools, 2025

    Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools, 2025. URL https://arxiv.org/abs/ 2502.04644

  43. [44]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URL https://arxiv.org/abs/2504.11536

  44. [45]

    Webagent-r1: Training web agents via end-to-end multi-turn rein- forcement learning.arXiv preprint arXiv:2505.16421,

    Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, and Lihong Li. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.16421

  45. [46]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

  46. [47]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimiza- tion with global advantage normalization, 2025. URLhttps://arxiv.org/abs/2501.03262

  47. [48]

    Buy 4 REINFORCE samples, get a baseline for free!,

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free!,

  48. [49]

    URLhttps://openreview.net/forum?id=r1lgTGL5DE

  49. [50]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  50. [51]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  51. [52]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071

  52. [53]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

  53. [54]

    Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen

    Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.19470. 13

  54. [55]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

  55. [56]

    Seeupo: Sequence-level agentic-rl with convergence guarantees, 2026

    Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, and Bolin Ding. Seeupo: Sequence-level agentic-rl with convergence guarantees, 2026. URLhttps://arxiv.org/abs/2602.06554

  56. [57]

    Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025

    Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025. URLhttps://arxiv.org/abs/2505.11821

  57. [58]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978

  58. [59]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Hongli Yu, Ting Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.ArXiv, abs/2507.02259, 2025. URL https://api.semanticscholar. org/CorpusID:280047896

  59. [60]

    Look back to reason forward: Revisitable memory for long-context llm agents.ArXiv, abs/2509.23040,

    Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context llm agents.ArXiv, abs/2509.23040,

  60. [61]

    URLhttps://api.semanticscholar.org/CorpusID:281676451

  61. [62]

    Exploiting tree structure for credit assignment in rl training of llms.ArXiv, abs/2509.18314, 2025

    Hieu Tran, Zonghai Yao, and Hong Yu. Exploiting tree structure for credit assignment in rl training of llms.ArXiv, abs/2509.18314, 2025. URL https://api.semanticscholar.org/CorpusID: 281496178

  62. [63]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, and Others. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

  63. [64]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, and Others. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 14 A Open Source Multi-Turn Dialogue RL Framework: verl-MICA As part of the resources accompanying this work, we introduce verl-MICA (githublink), a highly scalable reinforcement learning (RL)...

  64. [65]

    Proof.Letρdenote the Pearson correlation coefficient betweenXandY, ρ= Cov(X, Y) σ(X)σ(Y) .(8) By the Cauchy–Schwarz inequality, |ρ| ≤1

    Then their covariance is bounded asCov(X, Y)∈[−1,1]. Proof.Letρdenote the Pearson correlation coefficient betweenXandY, ρ= Cov(X, Y) σ(X)σ(Y) .(8) By the Cauchy–Schwarz inequality, |ρ| ≤1 . Since σ(X) =σ(Y) = 1 , it follows that Cov(X, Y)∈ [−1,1]. □ Proposition 1.Let X and Y be two random variables with Var(X) = Var(Y) = 1 . For any convex combination Z=α...

  65. [66]

    turn the wheel fully, check the mirror, straighten it

    (Whenc= 1,Var(Z)is constant inα.) C Experiment Details C.1 Benchmarks EMPAcontains 30 private test cases, with Gemini-2.5-pro [ 58] as the judge. The model being tested has up to 45 turns to calm down a simulated user (also played by Gemini-2.5-pro) and address their emotional needs. If the model causes the user’s emotional state to regress for 5 consecut...