arxiv: 2603.06194 · v2 · submitted 2026-03-06 · 💻 cs.CL · cs.AI

Recognition: no theorem link

MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue

Naifan Zhang , Ruihan Sun , Jinwei Su , Hengjie Yang , Zhengyuan Pan , Zhaohan Chen , Xiaofan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-turn RLcredit assignmentemotional support dialoguepotential functionLLMincremental distance rewardreinforcement learning for dialogues

0 comments

The pith

MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state for multi-turn emotional support dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MICA as a critic-free reinforcement learning method for long-horizon emotional support conversations with large language models. It addresses sparse rewards and credit assignment challenges by using a potential function on the user's structured support state to generate both immediate and delayed signals. Immediate credit comes from the per-turn reduction in distance to the target state, while delayed credit uses the Monte Carlo return of that same distance measure. These are normalized and combined into a mixed advantage for stable per-turn updates without needing matched state comparisons or rollout trees. Experiments on multiple benchmarks show consistent gains over existing methods like GRPO and REINFORCE++ with no extra computational cost.

Core claim

MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic.

What carries the argument

Shared potential function on the user's structured support state, which produces the Incremental Distance Reward for immediate credit and its Monte Carlo return for delayed credit, mixed after normalization into an advantage signal.

If this is right

Supports stable optimization in multi-turn RL tasks lacking matched states or external supervision.
Outperforms GRPO and REINFORCE++ on EMPA, EQ-Bench, and EmoBench benchmarks with up to +43.2 improvement on EMPA.
Requires no additional rollout cost and maintains robustness to variations in reward judges.
Extends RL applicability to interactive LLM scenarios with long-horizon dialogues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might apply to other dialogue domains where progress toward a goal state can be quantified without direct comparisons.
By avoiding a learned critic, it could reduce variance and training instability in fine-tuning large models for conversations.
If the state representation generalizes, similar potential functions could simplify credit assignment in other sequential decision tasks.

Load-bearing premise

That a structured support state can be defined such that the residual distance to the target state provides a reliable and comparable signal for assigning per-turn credit without matched-state comparisons or external supervision.

What would settle it

Demonstrating that no consistent structured support state exists across different dialogues or that the distance-based rewards do not correlate with human judgments of support quality would invalidate the credit assignment mechanism.

Figures

Figures reproduced from arXiv: 2603.06194 by Hengjie Yang, Jinwei Su, Naifan Zhang, Ruihan Sun, Xiaofan Zhang, Zhaohan Chen, Zhengyuan Pan.

**Figure 1.** Figure 1: Framework of the MICA. The policy model interacts with Actor to collect multi-turn trajectories, which is then optimized via Mixed-Advantage. to optimize long-term emotional outcomes[34–36], including search-based[37] or reward-modelbased frameworks such as CSO [34], RLVER [16] and Echo-N1 [38]. To evaluate conversation quality, recent benchmarks adopt LLM-as-a-Judge paradigms. SAGE [39] models evolving e… view at source ↗

**Figure 2.** Figure 2: MICA Overall Framwork. Given an initial prompt, we sample K trajectories from the current policy, each consisting of T turns. The turn-level advantage is computed by normalizing returns across samples at the same turn. The group-level advantage is computed by normalizing rewards over all K × T samples in the group. The final advantage is a convex combination of these two terms, balancing fine-grained credi… view at source ↗

**Figure 3.** Figure 3: Distribution of Monte Carlo returns and immediate rewards across dialogue turns at specific training step. (a) Monte Carlo returns exhibit a clear positive correlation with the turn index; (b) In contrast, immediate rewards show no discernible trend across turns. A t a (i) t = R (i) t − µt σt (2) where µt = 1 Nt P i∈It R (i) t and σt = r 1 Nt P i∈It R (i) t − µt 2 are turn-wise mean and standard dev… view at source ↗

**Figure 4.** Figure 4: Empathy alignment scores across various dimensions. MICA consistently outperforms GRPO and REINFORCE++ variants across all dimensions and model scales, showing greater alignment gains over Base Model Overview of evaluation metrics. In the EMPA benchmark, evaluation goes beyond scenariolevel pass/fail outcomes. We additionally score each response turn as a three-dimensional coordinate along distinct empa… view at source ↗

**Figure 5.** Figure 5: Reward and gradient norm curves of Qwen3-8b and Qwen2.5-7b-instruct under various advantage. Mixed Advantage achieves the highest converged reward while maintaining stable gradient norms, demonstrating simultaneous improvements in both reward performance and training stability. D.2 Sensitivity Analysis of Turn-Level Advantage Weight To understand how the relative contribution of long-horizon and per-turn f… view at source ↗

**Figure 6.** Figure 6: Effect of the Turn-level advantage weight α on dialogue strategy and benchmark performance. We sweep α∈ {0.0, 0.1, . . . , 1.0} in the mixed advantage A = αAt + (1 − α)Ab on Qwen2.5-7B-Instruct and run 5 independent trials per configuration. Error bars denote one standard deviation across the 5 trials; small dots denote per-trial values. Increasing α correlates with longer, more stable EMPA dialogues, whil… view at source ↗

**Figure 7.** Figure 7: shows reward trajectories during MICA training for four base models—Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and Qwen3-32B—while keeping the actor fixed to Qwen3-235B. Each panel corresponds to one base model; within each panel, the Judger varies across Qwen3-235B, MiniMax-M2.5, and GLM-4.7, while the MICA training recipe is otherwise identical across all 12 runs. The x-axis denotes the cumulative number … view at source ↗

**Figure 8.** Figure 8: Cosine similarity of Judger score directions on fixed EMPA traces. Gemini-2.5-pro generates dialogue trajectories for the 30 EMPA test cases, and the same response turns are rescored by Qwen3-235B, MiniMax-M2.5, and GLM-4.7. Each entry reports the averaged cosine similarity between per-response score-change vectors (∆xt, ∆yt, ∆zt). D.3.2 Cross-Judger Agreement on Score Directions To isolate scoring agreeme… view at source ↗

read the original abstract

Reinforcement learning (RL) for large language models (LLMs) has shown strong performance in single-turn tasks, but extending it to multi-turn interaction remains challenging due to sparse rewards and poor per-turn credit assignment. In emotional support dialogues, responses shape future user states, so matched-state step-wise comparison is unavailable, while trajectory-level supervision is insufficient. We propose MICA (Multi-granularity Intertemporal Credit Assignment), a critic-free RL framework for multi-turn emotional support tasks. MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. On EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B, MICA consistently outperforms GRPO and REINFORCE++, achieving up to +43.2 on EMPA, while adding no rollout cost and remaining robust to reward judges. These results show that turn-aware credit assignment enables effective and practical multi-turn RL for interactive LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MICA gives a critic-free mixed advantage for multi-turn emotional support RL by combining per-turn state distance drops with Monte Carlo returns, and it reports solid benchmark gains, but the state definition and normalization need more scrutiny.

read the letter

MICA's main contribution is a way to assign both immediate and delayed credit in long-horizon dialogue RL without a learned critic or matched-state comparisons. It defines a potential function over a structured user support state, takes the per-turn drop in residual distance as an incremental reward, adds the Monte Carlo return of that signal, normalizes by scope, and uses the result as a mixed advantage for policy updates. This is applied to emotional support tasks where future states are hard to align directly. The experiments show consistent wins over GRPO and REINFORCE++ on EMPA, EQ-Bench, and EmoBench across Qwen2.5-7B and Qwen3-8B/14B/32B models, with gains up to 43 points on EMPA, no added rollout cost, and stability across different reward judges. Those results are the strongest part of the paper and give a practical template for similar delayed-reward dialogue settings. The soft spot is the structured support state itself. The method treats residual distance to the target as a reliable, comparable signal, but the abstract gives no derivation or validation that this distance is objective or free of external supervision. If building the state or the distance metric requires pre-trained classifiers, annotations, or domain-specific engineering, then the credit signal is less independent than claimed and the normalization step could be fitting to observed performance rather than derived. Without ablations on those choices or error bars, it is hard to tell how much of the gain comes from the mixed advantage versus careful state engineering. This paper is aimed at researchers doing RL fine-tuning of LLMs for interactive, multi-turn applications like tutoring or support. Anyone looking for a lightweight alternative to critics or tree search in those domains would find the template useful, even if they have to adapt the state definition. I would send it to peer review so referees can examine the state construction and run the missing controls.

Referee Report

3 major / 1 minor

Summary. The paper proposes MICA, a critic-free RL framework for multi-turn emotional support dialogues. It defines a potential function over a structured user support state to compute an Incremental Distance Reward (per-turn decrease in residual distance to target) for immediate credit and its Monte Carlo return for delayed credit; after scope-specific normalization these form a mixed advantage enabling stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. Experiments on EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B models report consistent outperformance over GRPO and REINFORCE++, with gains up to +43.2 on EMPA and no added rollout cost.

Significance. If the structured support state and distance metric can be shown to be objectively definable and free of hidden supervision, MICA would provide an efficient, low-overhead solution to intertemporal credit assignment in long-horizon LLM interactions. The reported gains across model scales and robustness to reward judges indicate practical utility for emotional support systems, but the significance depends on validating that the residual-distance signal is comparable across dialogues without external annotations or classifiers.

major comments (3)

[§3] §3 (Method): The construction of the 'structured support state' and the residual distance metric are not specified (e.g., whether via embeddings, rule-based features, or pre-trained classifiers). This is load-bearing for the central claim that Incremental Distance Reward supplies an objective, supervision-free per-turn signal; without the definition the 'no external supervision' guarantee cannot be evaluated.
[§4.2] §4.2 (Normalization): The scope-specific normalization factors are free parameters whose values are not derived or ablated. Because the mixed advantage is formed only after these factors are applied, the reported stability and outperformance may be sensitive to their choice rather than following directly from the potential function.
[Table 2] Table 2 / §5.1: Performance numbers are given without error bars, standard deviations, or statistical tests across the three benchmarks and multiple model sizes. This weakens the claim of consistent outperformance, especially when the central derivation relies on the comparability of the distance signal.

minor comments (1)

[Abstract] Abstract and §1: The phrase 'adding no rollout cost' is unclear without a direct comparison of total wall-clock or token compute against the baselines that do use rollouts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of MICA. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Method): The construction of the 'structured support state' and the residual distance metric are not specified (e.g., whether via embeddings, rule-based features, or pre-trained classifiers). This is load-bearing for the central claim that Incremental Distance Reward supplies an objective, supervision-free per-turn signal; without the definition the 'no external supervision' guarantee cannot be evaluated.

Authors: We agree the construction details were insufficiently explicit in §3. The structured support state is defined as a fixed-dimensional vector of rule-based features (e.g., counts of emotional valence indicators, support-need categories, and dialogue-turn progress markers extracted via deterministic heuristics from the user utterance and history). The residual distance is the L1 norm between the current state vector and a target state vector representing complete emotional support resolution. No embeddings, pre-trained classifiers, or external annotations are used. In the revision we will add this formal definition, pseudocode, and an example computation to §3 to make the supervision-free property verifiable. revision: yes
Referee: [§4.2] §4.2 (Normalization): The scope-specific normalization factors are free parameters whose values are not derived or ablated. Because the mixed advantage is formed only after these factors are applied, the reported stability and outperformance may be sensitive to their choice rather than following directly from the potential function.

Authors: The normalization factors are set to the inverse of the expected range of the distance metric within each scope (immediate vs. Monte Carlo) so that the two advantage components have unit variance before mixing. We will revise §4.2 to derive these factors explicitly from the potential-function bounds and add an ablation table showing performance across a range of scaling values, confirming that outperformance holds for any reasonable choice within the derived interval. revision: yes
Referee: [Table 2] Table 2 / §5.1: Performance numbers are given without error bars, standard deviations, or statistical tests across the three benchmarks and multiple model sizes. This weakens the claim of consistent outperformance, especially when the central derivation relies on the comparability of the distance signal.

Authors: We acknowledge that error bars and statistical tests are missing. In the revision we will rerun all experiments with three random seeds, report mean ± standard deviation in Table 2, and add paired t-test p-values for MICA vs. baselines in §5.1. This will directly address concerns about the reliability of the distance-signal comparability across dialogues. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in MICA derivation

full rationale

The paper introduces Incremental Distance Reward as the per-turn decrease in residual distance to a target state drawn from a shared potential function, then combines it with its Monte Carlo return after scope-specific normalization to form a mixed advantage. This is a direct definitional construction for the credit signal rather than a reduction of an independent prediction or first-principles result back to fitted inputs. No equations or claims in the provided text show the central result being equivalent to its own inputs by construction, no self-citations are used to justify uniqueness or load-bearing premises, and no ansatz is smuggled or known result renamed. The framework is presented as a design for critic-free multi-turn RL with empirical results on EMPA, EQ-Bench, and EmoBench, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the existence of a measurable structured support state and a potential function whose distance metric yields usable per-turn signals; normalization factors are introduced without independent justification in the abstract.

free parameters (1)

scope-specific normalization factors
Used to combine immediate distance reward and Monte Carlo return into the mixed advantage; values are not derived from first principles.

axioms (1)

domain assumption A structured support state exists that admits a well-defined residual distance to a target state usable for credit assignment.
Invoked to define the potential function and Incremental Distance Reward.

invented entities (1)

Incremental Distance Reward no independent evidence
purpose: Provide immediate per-turn credit by measuring decrease in residual distance to target state.
Newly defined component of the mixed advantage.

pith-pipeline@v0.9.0 · 5555 in / 1442 out tokens · 54696 ms · 2026-05-15T15:20:51.021638+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 16 internal anchors

[1]

Affective computing in the era of large language models: A survey 9 from the nlp perspective.ArXiv, abs/2408.04638, 2024

Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, and Ge Yu. Affective computing in the era of large language models: A survey 9 from the nlp perspective.ArXiv, abs/2408.04638, 2024. URL https://api.semanticscholar. org/CorpusID:271843516

work page arXiv 2024
[2]

Varshney

William Kidder, Jason D’Cruz, and Kush R. Varshney. Empathy and the right to be an exception: What llms can and cannot do.ArXiv, abs/2401.14523, 2024. URL https://api.semanticscholar. org/CorpusID:267301044

work page arXiv 2024
[3]

The illusion of empathy: how ai chatbots shape conversation perception

Tingting Liu, Salvatore Giorgi, Ankit Aich, Allison Lahnala, Brenda Curtis, Lyle Ungar, and João Sedoc. The illusion of empathy: how ai chatbots shape conversation perception. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Sympos...

work page doi:10.1609/aaai.v39i13 2025
[4]

MIME: MIMicking emotions for empathetic response generation

Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. MIME: MIMicking emotions for empathetic response generation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages ...

work page 2020
[5]

doi: 10.18653/v1/2020.emnlp-main.721

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.721. URL https: //aclanthology.org/2020.emnlp-main.721/

work page doi:10.18653/v1/2020.emnlp-main.721 2020
[6]

Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , url =

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open- domain conversation models: A new benchmark and dataset. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy, July 2019. Association for...

work page doi:10.18653/v1/p19-1534 2019
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874, 2021. URLhttps://api.semanticscholar.org/CorpusID:232134851

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar. org/CorpusID:239998651

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

A survey on large language models for code generation.ACM Trans

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Trans. Softw. Eng. Methodol., 35(2), January 2026. ISSN 1049-331X. doi: 10.1145/3747588. URLhttps://doi.org/10.1145/3747588

work page doi:10.1145/3747588 2026
[10]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online,...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[11]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

work page doi:10.18653/v1/d18-1259 2018
[12]

Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards, 2025

Ting Yang, Li Chen, and Huimin Wang. Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards, 2025. URLhttps://arxiv.org/abs/2508. 12935

work page 2025
[13]

Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach

Jinfeng Zhou, Zhuang Chen, Bo Wang, and Minlie Huang. Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers),...

work page doi:10.18653/v1/2023.acl-long.96 2023
[14]

SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi- turn empathy conversations

Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi- turn empathy conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1170–1...

work page 2023
[15]

doi: 10.18653/v1/2023.findings-emnlp.83

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.83. URL https://aclanthology.org/2023.findings-emnlp.83/

work page doi:10.18653/v1/2023.findings-emnlp.83 2023
[16]

Self-chats from large language models make small emotional support chatbot better

Zhonghua Zheng, Lizi Liao, Yang Deng, Libo Qin, and Liqiang Nie. Self-chats from large language models make small emotional support chatbot better. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11325–11345, Bangkok, Thailand,...

work page doi:10.18653/v1/2024.acl-long.611 2024
[17]

Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards.ArXiv, abs/2508.12935, 2025

Ting Yang, Li Chen, and Huimin Wang. Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards.ArXiv, abs/2508.12935, 2025. URL https: //api.semanticscholar.org/CorpusID:280677049

work page arXiv 2025
[18]

Rlver: Reinforcement learning with verifiable emotion rewards for empathetic agents, 2025

Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Rlver: Reinforcement learning with verifiable emotion rewards for empathetic agents, 2025. URL https://arxiv.org/abs/2507.03112

work page arXiv 2025
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Rtmc: Step-level credit assignment via rollout trees

Tao Wang, Suhang Zheng, and Xiaoxiao Xu. Rtmc: Step-level credit assignment via rollout trees. 2026. URLhttps://api.semanticscholar.org/CorpusID:287432778

work page 2026
[22]

Empa: Evaluating persona-aligned empathy as a process, 2026

Shiya Zhang, Yuhan Zhan, Ruixi Su, Ruihan Sun, Ziyi Song, Zhaohan Chen, and Xiaofan Zhang. Empa: Evaluating persona-aligned empathy as a process, 2026. URL https://arxiv.org/abs/2603. 00552

work page 2026
[23]

X. Lu, H. M. Schwartz, and S. N. Givigi. Policy invariance under reward transformations for general-sum stochastic games.Journal of Artificial Intelligence Research, 41:397–406, 2011. ISSN 1076-9757. doi: 10.1613/jair.3384. URLhttp://dx.doi.org/10.1613/jair.3384

work page doi:10.1613/jair.3384 2011
[24]

Samuel J. Paech. Eq-bench: An emotional intelligence benchmark for large language models, 2024. URL https://arxiv.org/abs/2312.06281

work page arXiv 2024
[25]

Liu, Jinfeng Zhou, Alvionna S

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M. Liu, Jinfeng Zhou, Alvionna S. Sunaryo, Juanzi Li, Tatia M. C. Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models, 2024. URLhttps://arxiv.org/abs/2402.12071

work page arXiv 2024
[26]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Towards emotional support dialog systems

Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. Towards emotional support dialog systems. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural L...

work page doi:10.18653/v1/2021.acl-long.269 2021
[29]

Beyond silent letters: Amplifying LLMs in emotion recognition with vocal nuances

Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. Beyond silent letters: Amplifying LLMs in emotion recognition with vocal nuances. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 2202–2218, Albuquerque, New Mexico, April 2025. Association for C...

work page doi:10.18653/v1/2025.findings-naacl.117 2025
[30]

Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics

Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, and Bingquan Liu. Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics. InInternational Conference on Computational Linguistics, 2024. URL https://api.semanticscholar.org/ CorpusID:268363554

work page 2024
[31]

A computational approach to understanding empathy expressed in text-based mental health support

Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A computational approach to understanding empathy expressed in text-based mental health support. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 5263–5276, Online, November 2020. A...

work page doi:10.18653/v1/2020.emnlp-main.425 2020
[32]

Augesc: Dialogue augmenta- tion with large language models for emotional support conversation

Chujie Zheng, Sahand Sabour, Jiaxin Wen, Zheng Zhang, and Minlie Huang. Augesc: Dialogue augmenta- tion with large language models for emotional support conversation. InAnnual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID: 258588110

work page 2022
[33]

SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support

Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, Miami, Florida, USA, November 2024. Ass...

work page doi:10.18653/v1/2024.findings-emnlp.34 2024
[34]

Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation

Wei Peng, Yue Hu, Luxi Xing, Yuqiang Xie, Yajing Sun, and Yunpeng Li. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. InInternational Joint Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/ CorpusID:248406141

work page 2022
[35]

Cause-aware empathetic response generation via chain-of-thought fine-tuning.ArXiv, abs/2408.11599,

Xinhao Chen, Chong Yang, Man Lan, Li Cai, Yang Chen, Tu Hu, Xinlin Zhuang, and Aimin Zhou. Cause-aware empathetic response generation via chain-of-thought fine-tuning.ArXiv, abs/2408.11599,

work page arXiv
[36]

URLhttps://api.semanticscholar.org/CorpusID:271916313

work page
[37]

Chain of strategy optimization makes large language models better emotional supporter, 2025

Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, and Ting Liu. Chain of strategy optimization makes large language models better emotional supporter, 2025. URLhttps://arxiv.org/abs/2503.05362

work page arXiv 2025
[38]

Kardia- r1: Unleashing llms to reason toward understanding and empathy for emotional support via rubric-as- judge reinforcement learning

Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou, and Usman Naseem. Kardia- r1: Unleashing llms to reason toward understanding and empathy for emotional support via rubric-as- judge reinforcement learning. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9230–9240, New York, NY , USA, 2026. Association for Computing Machinery...

work page doi:10.1145/3774904.3793022 2026
[39]

EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning

Mingxiu Cai, Daling Wang, Shi Feng, and Yifei Zhang. EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Li...

work page 2024
[40]

Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements

Yushan Qian, Weinan Zhang, and Ting Liu. Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6516– 6528, Singapore, December 2023. Association for Comput...

work page doi:10.18653/v1/2023 2023
[41]

Echo-n1: Affective rl frontier, 2025

Naifan Zhang, Ruihan Sun, Ruixi Su, Shiqi Ma, Shiya Zhang, Xianna Weng, Xiaofan Zhang, Yuhan Zhan, Yuyang Xu, Zhaohan Chen, Zhengyuan Pan, and Ziyi Song. Echo-n1: Affective rl frontier, 2025. URL https://arxiv.org/abs/2512.00344

work page arXiv 2025
[42]

Sentient agent as a judge: Evaluating higher-order social cognition in large language models, 2025

Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Sentient agent as a judge: Evaluating higher-order social cognition in large language models, 2025. URL https://arxiv.org/ abs/2505.02847

work page arXiv 2025
[43]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools, 2025

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools, 2025. URL https://arxiv.org/abs/ 2502.04644

work page arXiv 2025
[44]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URL https://arxiv.org/abs/2504.11536

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Webagent-r1: Training web agents via end-to-end multi-turn rein- forcement learning.arXiv preprint arXiv:2505.16421,

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, and Lihong Li. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.16421

work page arXiv 2025
[46]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[47]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimiza- tion with global advantage normalization, 2025. URLhttps://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Buy 4 REINFORCE samples, get a baseline for free!,

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free!,

work page
[49]

URLhttps://openreview.net/forum?id=r1lgTGL5DE

work page
[50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.19470. 13

work page arXiv 2025
[55]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Seeupo: Sequence-level agentic-rl with convergence guarantees, 2026

Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, and Bolin Ding. Seeupo: Sequence-level agentic-rl with convergence guarantees, 2026. URLhttps://arxiv.org/abs/2602.06554

work page arXiv 2026
[57]

Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025

Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025. URLhttps://arxiv.org/abs/2505.11821

work page arXiv 2025
[58]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Ting Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.ArXiv, abs/2507.02259, 2025. URL https://api.semanticscholar. org/CorpusID:280047896

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Look back to reason forward: Revisitable memory for long-context llm agents.ArXiv, abs/2509.23040,

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context llm agents.ArXiv, abs/2509.23040,

work page arXiv
[61]

URLhttps://api.semanticscholar.org/CorpusID:281676451

work page
[62]

Exploiting tree structure for credit assignment in rl training of llms.ArXiv, abs/2509.18314, 2025

Hieu Tran, Zonghai Yao, and Hong Yu. Exploiting tree structure for credit assignment in rl training of llms.ArXiv, abs/2509.18314, 2025. URL https://api.semanticscholar.org/CorpusID: 281496178

work page arXiv 2025
[63]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, and Others. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, and Others. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 14 A Open Source Multi-Turn Dialogue RL Framework: verl-MICA As part of the resources accompanying this work, we introduce verl-MICA (githublink), a highly scalable reinforcement learning (RL)...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Proof.Letρdenote the Pearson correlation coefficient betweenXandY, ρ= Cov(X, Y) σ(X)σ(Y) .(8) By the Cauchy–Schwarz inequality, |ρ| ≤1

Then their covariance is bounded asCov(X, Y)∈[−1,1]. Proof.Letρdenote the Pearson correlation coefficient betweenXandY, ρ= Cov(X, Y) σ(X)σ(Y) .(8) By the Cauchy–Schwarz inequality, |ρ| ≤1 . Since σ(X) =σ(Y) = 1 , it follows that Cov(X, Y)∈ [−1,1]. □ Proposition 1.Let X and Y be two random variables with Var(X) = Var(Y) = 1 . For any convex combination Z=α...

work page
[66]

turn the wheel fully, check the mirror, straighten it

(Whenc= 1,Var(Z)is constant inα.) C Experiment Details C.1 Benchmarks EMPAcontains 30 private test cases, with Gemini-2.5-pro [ 58] as the judge. The model being tested has up to 45 turns to calm down a simulated user (also played by Gemini-2.5-pro) and address their emotional needs. If the model causes the user’s emotional state to regress for 5 consecut...

work page