Recognition: no theorem link
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue
Pith reviewed 2026-05-15 15:20 UTC · model grok-4.3
The pith
MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state for multi-turn emotional support dialogues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic.
What carries the argument
Shared potential function on the user's structured support state, which produces the Incremental Distance Reward for immediate credit and its Monte Carlo return for delayed credit, mixed after normalization into an advantage signal.
If this is right
- Supports stable optimization in multi-turn RL tasks lacking matched states or external supervision.
- Outperforms GRPO and REINFORCE++ on EMPA, EQ-Bench, and EmoBench benchmarks with up to +43.2 improvement on EMPA.
- Requires no additional rollout cost and maintains robustness to variations in reward judges.
- Extends RL applicability to interactive LLM scenarios with long-horizon dialogues.
Where Pith is reading between the lines
- This approach might apply to other dialogue domains where progress toward a goal state can be quantified without direct comparisons.
- By avoiding a learned critic, it could reduce variance and training instability in fine-tuning large models for conversations.
- If the state representation generalizes, similar potential functions could simplify credit assignment in other sequential decision tasks.
Load-bearing premise
That a structured support state can be defined such that the residual distance to the target state provides a reliable and comparable signal for assigning per-turn credit without matched-state comparisons or external supervision.
What would settle it
Demonstrating that no consistent structured support state exists across different dialogues or that the distance-based rewards do not correlate with human judgments of support quality would invalidate the credit assignment mechanism.
Figures
read the original abstract
Reinforcement learning (RL) for large language models (LLMs) has shown strong performance in single-turn tasks, but extending it to multi-turn interaction remains challenging due to sparse rewards and poor per-turn credit assignment. In emotional support dialogues, responses shape future user states, so matched-state step-wise comparison is unavailable, while trajectory-level supervision is insufficient. We propose MICA (Multi-granularity Intertemporal Credit Assignment), a critic-free RL framework for multi-turn emotional support tasks. MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. On EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B, MICA consistently outperforms GRPO and REINFORCE++, achieving up to +43.2 on EMPA, while adding no rollout cost and remaining robust to reward judges. These results show that turn-aware credit assignment enables effective and practical multi-turn RL for interactive LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MICA, a critic-free RL framework for multi-turn emotional support dialogues. It defines a potential function over a structured user support state to compute an Incremental Distance Reward (per-turn decrease in residual distance to target) for immediate credit and its Monte Carlo return for delayed credit; after scope-specific normalization these form a mixed advantage enabling stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. Experiments on EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B models report consistent outperformance over GRPO and REINFORCE++, with gains up to +43.2 on EMPA and no added rollout cost.
Significance. If the structured support state and distance metric can be shown to be objectively definable and free of hidden supervision, MICA would provide an efficient, low-overhead solution to intertemporal credit assignment in long-horizon LLM interactions. The reported gains across model scales and robustness to reward judges indicate practical utility for emotional support systems, but the significance depends on validating that the residual-distance signal is comparable across dialogues without external annotations or classifiers.
major comments (3)
- [§3] §3 (Method): The construction of the 'structured support state' and the residual distance metric are not specified (e.g., whether via embeddings, rule-based features, or pre-trained classifiers). This is load-bearing for the central claim that Incremental Distance Reward supplies an objective, supervision-free per-turn signal; without the definition the 'no external supervision' guarantee cannot be evaluated.
- [§4.2] §4.2 (Normalization): The scope-specific normalization factors are free parameters whose values are not derived or ablated. Because the mixed advantage is formed only after these factors are applied, the reported stability and outperformance may be sensitive to their choice rather than following directly from the potential function.
- [Table 2] Table 2 / §5.1: Performance numbers are given without error bars, standard deviations, or statistical tests across the three benchmarks and multiple model sizes. This weakens the claim of consistent outperformance, especially when the central derivation relies on the comparability of the distance signal.
minor comments (1)
- [Abstract] Abstract and §1: The phrase 'adding no rollout cost' is unclear without a direct comparison of total wall-clock or token compute against the baselines that do use rollouts.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify key aspects of MICA. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (Method): The construction of the 'structured support state' and the residual distance metric are not specified (e.g., whether via embeddings, rule-based features, or pre-trained classifiers). This is load-bearing for the central claim that Incremental Distance Reward supplies an objective, supervision-free per-turn signal; without the definition the 'no external supervision' guarantee cannot be evaluated.
Authors: We agree the construction details were insufficiently explicit in §3. The structured support state is defined as a fixed-dimensional vector of rule-based features (e.g., counts of emotional valence indicators, support-need categories, and dialogue-turn progress markers extracted via deterministic heuristics from the user utterance and history). The residual distance is the L1 norm between the current state vector and a target state vector representing complete emotional support resolution. No embeddings, pre-trained classifiers, or external annotations are used. In the revision we will add this formal definition, pseudocode, and an example computation to §3 to make the supervision-free property verifiable. revision: yes
-
Referee: [§4.2] §4.2 (Normalization): The scope-specific normalization factors are free parameters whose values are not derived or ablated. Because the mixed advantage is formed only after these factors are applied, the reported stability and outperformance may be sensitive to their choice rather than following directly from the potential function.
Authors: The normalization factors are set to the inverse of the expected range of the distance metric within each scope (immediate vs. Monte Carlo) so that the two advantage components have unit variance before mixing. We will revise §4.2 to derive these factors explicitly from the potential-function bounds and add an ablation table showing performance across a range of scaling values, confirming that outperformance holds for any reasonable choice within the derived interval. revision: yes
-
Referee: [Table 2] Table 2 / §5.1: Performance numbers are given without error bars, standard deviations, or statistical tests across the three benchmarks and multiple model sizes. This weakens the claim of consistent outperformance, especially when the central derivation relies on the comparability of the distance signal.
Authors: We acknowledge that error bars and statistical tests are missing. In the revision we will rerun all experiments with three random seeds, report mean ± standard deviation in Table 2, and add paired t-test p-values for MICA vs. baselines in §5.1. This will directly address concerns about the reliability of the distance-signal comparability across dialogues. revision: yes
Circularity Check
No significant circularity detected in MICA derivation
full rationale
The paper introduces Incremental Distance Reward as the per-turn decrease in residual distance to a target state drawn from a shared potential function, then combines it with its Monte Carlo return after scope-specific normalization to form a mixed advantage. This is a direct definitional construction for the credit signal rather than a reduction of an independent prediction or first-principles result back to fitted inputs. No equations or claims in the provided text show the central result being equivalent to its own inputs by construction, no self-citations are used to justify uniqueness or load-bearing premises, and no ansatz is smuggled or known result renamed. The framework is presented as a design for critic-free multi-turn RL with empirical results on EMPA, EQ-Bench, and EmoBench, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- scope-specific normalization factors
axioms (1)
- domain assumption A structured support state exists that admits a well-defined residual distance to a target state usable for credit assignment.
invented entities (1)
-
Incremental Distance Reward
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, and Ge Yu. Affective computing in the era of large language models: A survey 9 from the nlp perspective.ArXiv, abs/2408.04638, 2024. URL https://api.semanticscholar. org/CorpusID:271843516
- [2]
-
[3]
The illusion of empathy: how ai chatbots shape conversation perception
Tingting Liu, Salvatore Giorgi, Ankit Aich, Allison Lahnala, Brenda Curtis, Lyle Ungar, and João Sedoc. The illusion of empathy: how ai chatbots shape conversation perception. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Sympos...
-
[4]
MIME: MIMicking emotions for empathetic response generation
Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. MIME: MIMicking emotions for empathetic response generation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages ...
work page 2020
-
[5]
doi: 10.18653/v1/2020.emnlp-main.721
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.721. URL https: //aclanthology.org/2020.emnlp-main.721/
-
[6]
Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , url =
Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open- domain conversation models: A new benchmark and dataset. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy, July 2019. Association for...
-
[7]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874, 2021. URLhttps://api.semanticscholar.org/CorpusID:232134851
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar. org/CorpusID:239998651
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
A survey on large language models for code generation.ACM Trans
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Trans. Softw. Eng. Methodol., 35(2), January 2026. ISSN 1049-331X. doi: 10.1145/3747588. URLhttps://doi.org/10.1145/3747588
-
[10]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online,...
-
[11]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...
-
[12]
Ting Yang, Li Chen, and Huimin Wang. Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards, 2025. URLhttps://arxiv.org/abs/2508. 12935
work page 2025
-
[13]
Jinfeng Zhou, Zhuang Chen, Bo Wang, and Minlie Huang. Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers),...
-
[14]
Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi- turn empathy conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1170–1...
work page 2023
-
[15]
doi: 10.18653/v1/2023.findings-emnlp.83
Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.83. URL https://aclanthology.org/2023.findings-emnlp.83/
-
[16]
Self-chats from large language models make small emotional support chatbot better
Zhonghua Zheng, Lizi Liao, Yang Deng, Libo Qin, and Liqiang Nie. Self-chats from large language models make small emotional support chatbot better. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11325–11345, Bangkok, Thailand,...
-
[17]
Ting Yang, Li Chen, and Huimin Wang. Towards open-ended emotional support conversations in llms via reinforcement learning with future-oriented rewards.ArXiv, abs/2508.12935, 2025. URL https: //api.semanticscholar.org/CorpusID:280677049
-
[18]
Rlver: Reinforcement learning with verifiable emotion rewards for empathetic agents, 2025
Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Rlver: Reinforcement learning with verifiable emotion rewards for empathetic agents, 2025. URL https://arxiv.org/abs/2507.03112
-
[19]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Rtmc: Step-level credit assignment via rollout trees
Tao Wang, Suhang Zheng, and Xiaoxiao Xu. Rtmc: Step-level credit assignment via rollout trees. 2026. URLhttps://api.semanticscholar.org/CorpusID:287432778
work page 2026
-
[22]
Empa: Evaluating persona-aligned empathy as a process, 2026
Shiya Zhang, Yuhan Zhan, Ruixi Su, Ruihan Sun, Ziyi Song, Zhaohan Chen, and Xiaofan Zhang. Empa: Evaluating persona-aligned empathy as a process, 2026. URL https://arxiv.org/abs/2603. 00552
work page 2026
-
[23]
X. Lu, H. M. Schwartz, and S. N. Givigi. Policy invariance under reward transformations for general-sum stochastic games.Journal of Artificial Intelligence Research, 41:397–406, 2011. ISSN 1076-9757. doi: 10.1613/jair.3384. URLhttp://dx.doi.org/10.1613/jair.3384
- [24]
-
[25]
Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M. Liu, Jinfeng Zhou, Alvionna S. Sunaryo, Juanzi Li, Tatia M. C. Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models, 2024. URLhttps://arxiv.org/abs/2402.12071
-
[26]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Towards emotional support dialog systems
Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. Towards emotional support dialog systems. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural L...
-
[29]
Beyond silent letters: Amplifying LLMs in emotion recognition with vocal nuances
Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. Beyond silent letters: Amplifying LLMs in emotion recognition with vocal nuances. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 2202–2218, Albuquerque, New Mexico, April 2025. Association for C...
-
[30]
Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics
Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, and Bingquan Liu. Laerc-s: Improving llm-based emotion recognition in conversation with speaker characteristics. InInternational Conference on Computational Linguistics, 2024. URL https://api.semanticscholar.org/ CorpusID:268363554
work page 2024
-
[31]
A computational approach to understanding empathy expressed in text-based mental health support
Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A computational approach to understanding empathy expressed in text-based mental health support. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 5263–5276, Online, November 2020. A...
-
[32]
Augesc: Dialogue augmenta- tion with large language models for emotional support conversation
Chujie Zheng, Sahand Sabour, Jiaxin Wen, Zheng Zhang, and Minlie Huang. Augesc: Dialogue augmenta- tion with large language models for emotional support conversation. InAnnual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID: 258588110
work page 2022
-
[33]
SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support
Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, Miami, Florida, USA, November 2024. Ass...
-
[34]
Wei Peng, Yue Hu, Luxi Xing, Yuqiang Xie, Yajing Sun, and Yunpeng Li. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. InInternational Joint Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/ CorpusID:248406141
work page 2022
-
[35]
Cause-aware empathetic response generation via chain-of-thought fine-tuning.ArXiv, abs/2408.11599,
Xinhao Chen, Chong Yang, Man Lan, Li Cai, Yang Chen, Tu Hu, Xinlin Zhuang, and Aimin Zhou. Cause-aware empathetic response generation via chain-of-thought fine-tuning.ArXiv, abs/2408.11599,
-
[36]
URLhttps://api.semanticscholar.org/CorpusID:271916313
-
[37]
Chain of strategy optimization makes large language models better emotional supporter, 2025
Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, and Ting Liu. Chain of strategy optimization makes large language models better emotional supporter, 2025. URLhttps://arxiv.org/abs/2503.05362
-
[38]
Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou, and Usman Naseem. Kardia- r1: Unleashing llms to reason toward understanding and empathy for emotional support via rubric-as- judge reinforcement learning. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9230–9240, New York, NY , USA, 2026. Association for Computing Machinery...
-
[39]
Mingxiu Cai, Daling Wang, Shi Feng, and Yifei Zhang. EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Li...
work page 2024
-
[40]
Yushan Qian, Weinan Zhang, and Ting Liu. Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6516– 6528, Singapore, December 2023. Association for Comput...
-
[41]
Echo-n1: Affective rl frontier, 2025
Naifan Zhang, Ruihan Sun, Ruixi Su, Shiqi Ma, Shiya Zhang, Xianna Weng, Xiaofan Zhang, Yuhan Zhan, Yuyang Xu, Zhaohan Chen, Zhengyuan Pan, and Ziyi Song. Echo-n1: Affective rl frontier, 2025. URL https://arxiv.org/abs/2512.00344
-
[42]
Sentient agent as a judge: Evaluating higher-order social cognition in large language models, 2025
Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Sentient agent as a judge: Evaluating higher-order social cognition in large language models, 2025. URL https://arxiv.org/ abs/2505.02847
-
[43]
Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools, 2025
Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools, 2025. URL https://arxiv.org/abs/ 2502.04644
-
[44]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URL https://arxiv.org/abs/2504.11536
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, and Lihong Li. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.16421
-
[46]
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696
-
[47]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimiza- tion with global advantage normalization, 2025. URLhttps://arxiv.org/abs/2501.03262
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Buy 4 REINFORCE samples, get a baseline for free!,
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free!,
-
[49]
URLhttps://openreview.net/forum?id=r1lgTGL5DE
-
[50]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen
Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.19470. 13
-
[55]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Seeupo: Sequence-level agentic-rl with convergence guarantees, 2026
Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, and Bolin Ding. Seeupo: Sequence-level agentic-rl with convergence guarantees, 2026. URLhttps://arxiv.org/abs/2602.06554
-
[57]
Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025
Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025. URLhttps://arxiv.org/abs/2505.11821
-
[58]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Ting Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.ArXiv, abs/2507.02259, 2025. URL https://api.semanticscholar. org/CorpusID:280047896
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Look back to reason forward: Revisitable memory for long-context llm agents.ArXiv, abs/2509.23040,
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context llm agents.ArXiv, abs/2509.23040,
-
[61]
URLhttps://api.semanticscholar.org/CorpusID:281676451
-
[62]
Exploiting tree structure for credit assignment in rl training of llms.ArXiv, abs/2509.18314, 2025
Hieu Tran, Zonghai Yao, and Hong Yu. Exploiting tree structure for credit assignment in rl training of llms.ArXiv, abs/2509.18314, 2025. URL https://api.semanticscholar.org/CorpusID: 281496178
-
[63]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, and Others. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, and Others. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 14 A Open Source Multi-Turn Dialogue RL Framework: verl-MICA As part of the resources accompanying this work, we introduce verl-MICA (githublink), a highly scalable reinforcement learning (RL)...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Then their covariance is bounded asCov(X, Y)∈[−1,1]. Proof.Letρdenote the Pearson correlation coefficient betweenXandY, ρ= Cov(X, Y) σ(X)σ(Y) .(8) By the Cauchy–Schwarz inequality, |ρ| ≤1 . Since σ(X) =σ(Y) = 1 , it follows that Cov(X, Y)∈ [−1,1]. □ Proposition 1.Let X and Y be two random variables with Var(X) = Var(Y) = 1 . For any convex combination Z=α...
-
[66]
turn the wheel fully, check the mirror, straighten it
(Whenc= 1,Var(Z)is constant inα.) C Experiment Details C.1 Benchmarks EMPAcontains 30 private test cases, with Gemini-2.5-pro [ 58] as the judge. The model being tested has up to 45 turns to calm down a simulated user (also played by Gemini-2.5-pro) and address their emotional needs. If the model causes the user’s emotional state to regress for 5 consecut...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.