Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
Pith reviewed 2026-05-20 22:16 UTC · model grok-4.3
The pith
SLIM dynamically manages external skills in agentic RL by retaining high-value ones, retiring low-contribution ones, and adding new ones as needed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SLIM treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. It estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage.
What carries the argument
Leave-one-skill-out validation to measure marginal contribution of each active skill, which then triggers retain, retire, or expand decisions on the skill set during joint policy training.
If this is right
- SLIM outperforms the best baselines by an average of 7.1 percentage points across ALFWorld and SearchQA.
- Policy learning and external skill retention are compatible: some skills become absorbed into the policy while others continue to deliver value when kept external.
- Dynamic management of the active skill set provides a more general paradigm for skill-based agentic RL than methods that assume skills either accumulate persistently or are fully internalized.
Where Pith is reading between the lines
- The same retain-retire-expand logic could be tested in other tool-using agent settings where the total number of available skills grows large.
- Periodic skill-value audits might reduce memory and compute costs in long-horizon agent training without sacrificing final performance.
- The non-monotonic skill-set pattern suggests that lifelong agent learning may benefit from explicit retirement mechanisms rather than only addition or compression.
Load-bearing premise
Because of limited parametric capacity and uneven contributions across skills, the best active skill set changes over time and depends on the specific task and training stage.
What would settle it
Running the same ALFWorld and SearchQA experiments with a fixed or monotonically accumulating skill set and finding equal or higher average performance than SLIM would show the dynamic lifecycle operations are not necessary.
Figures
read the original abstract
Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SLIM, a framework for dynamic Skill LIfecycle Management in agentic reinforcement learning. It models the active external skill set as a jointly optimized variable with policy learning, using leave-one-skill-out validation to estimate each skill's marginal contribution and applying retain, retire, and expand operations. The work reports that SLIM achieves a 7.1 percentage point average improvement over the best baselines on ALFWorld and SearchQA, and concludes that policy learning and external skill retention are not mutually exclusive, with some skills absorbed into the policy while others retain external value.
Significance. If the experimental claims are substantiated with full protocols and verification, the result would support a more flexible paradigm for skill-based agentic RL that avoids the restrictive assumptions of either permanent external skill accumulation or complete internalization. This could inform more adaptive agent designs under limited parametric capacity, particularly for tasks where optimal skill sets are task- and stage-dependent.
major comments (2)
- [Abstract] Abstract: the reported 7.1% average improvement is presented without any description of the experimental protocol, baseline implementations, number of runs, statistical tests, or error bars, so the central performance claim cannot be evaluated or reproduced from the manuscript.
- [Abstract] Abstract: the claim that skill retirement demonstrates absorption into the parametric policy (supporting the non-mutually-exclusive relationship) is not directly tested; retirement after negligible leave-one-skill-out contribution could instead arise from task completion, distribution shift, or alternative strategies, and no probing of the policy on isolated skill-specific sub-tasks is described to distinguish these cases.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, proposing targeted revisions to enhance clarity and strengthen the evidential basis of our claims while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported 7.1% average improvement is presented without any description of the experimental protocol, baseline implementations, number of runs, statistical tests, or error bars, so the central performance claim cannot be evaluated or reproduced from the manuscript.
Authors: We agree that the abstract's brevity leaves the central claim difficult to evaluate in isolation. The full experimental protocol, including baseline implementations (ReAct, Reflexion, and skill-augmented variants), 5 independent runs per condition, paired t-tests for significance, and error bars, is detailed in Section 4 and Appendix B. In the revised manuscript we will expand the abstract with a concise clause summarizing the evaluation setup and number of runs, while retaining the word limit, and ensure the main text explicitly cross-references these details. revision: yes
-
Referee: [Abstract] Abstract: the claim that skill retirement demonstrates absorption into the parametric policy (supporting the non-mutually-exclusive relationship) is not directly tested; retirement after negligible leave-one-skill-out contribution could instead arise from task completion, distribution shift, or alternative strategies, and no probing of the policy on isolated skill-specific sub-tasks is described to distinguish these cases.
Authors: This observation is correct: our current support for absorption is inferential, resting on the leave-one-skill-out marginal contribution dropping to negligible levels after policy updates while overall task performance is maintained. Alternative explanations such as task completion or distribution shift cannot be ruled out without additional controls. We will therefore add a new analysis subsection that probes the updated policy on isolated skill-specific sub-tasks for a representative sample of retired skills, comparing success rates before and after retirement to provide more direct evidence of internalization versus other factors. revision: yes
Circularity Check
No circularity: empirical framework and results are self-contained
full rationale
The paper introduces SLIM as a joint optimization of policy and dynamic skill set via leave-one-skill-out validation for marginal contribution, followed by retain/retire/expand operations. No equations or definitions in the provided text reduce the 7.1% performance gain, the non-mutually-exclusive claim, or the absorption interpretation to a fitted parameter or self-referential input by construction. Leave-one-skill-out is presented as an independent estimator, and the lifecycle decisions are applied to observed validation outcomes without forcing the reported results or paradigm claim. The derivation chain relies on standard RL experimentation rather than any self-definitional or fitted-input reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- negligible-contribution threshold
axioms (1)
- domain assumption With limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SLIM estimates each active skill’s marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the optimal active skill set is non-monotonic, task- and stage-dependent
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs.arXiv preprint arXiv:2402.14740, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Large language models for mathematical reasoning: Progresses and challenges
Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges.arXiv preprint arXiv:2402.00157, 2024
-
[3]
Physics of language models: Part 3.1, knowledge storage and extraction
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InProceedings of the 41st International Conference on Machine Learning, pages 1067–1077, Vienna, Austria, 2024
work page 2024
-
[4]
Physics of language models: Part 3.2, knowledge manip- ulation
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manip- ulation. InProceedings of the 13th International Conference on Learning Representations, Singapore, Singapore, 2025
work page 2025
-
[5]
Physics of language models: Part 3.3, knowledge capacity scaling laws
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. InProceedings of the 13th International Conference on Learning Representations, Singapore, Singapore, 2025
work page 2025
-
[6]
Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku
Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use, 2024
work page 2024
-
[7]
Anthropic. Agent skills. https://docs.claude.com/en/docs/agents-and-tools/ agent-skills, 2025
work page 2025
-
[8]
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses
Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, and Yangkun Zhu. Internvla-m1: A ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...
work page 2024
-
[12]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page 2025
-
[13]
Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. Autoguide: Automated generation and selection of context-aware guidelines for large language model agents.arXiv preprint arXiv:2403.08978, 2024
-
[14]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InProceedings of the 36th International Conference on Machine Learning, pages 2052–2062, Long Beach, CA, 2019
work page 2052
-
[15]
Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D Manning. Synthetic data generation & multi-step rl for reasoning & tool use.arXiv preprint arXiv:2504.04736, 2025
-
[16]
Gemini Deep Research — your personal research assistant
Google. Gemini Deep Research — your personal research assistant. https://gemini. google/overview/deep-research/, 2025
work page 2025
-
[17]
SRSA: skill retrieval and adaptation for robotic assembly tasks
Yijie Guo, Bingjie Tang, Iretiayo Akinola, Dieter Fox, Abhishek Gupta, and Yashraj Narang. SRSA: skill retrieval and adaptation for robotic assembly tasks. InProceedings of the 13th International Conference on Learning Representations, Singapore, Singapore, 2025. OpenRe- view.net
work page 2025
-
[18]
Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. InAdvances in Neural Information Processing Systems 36, New Orleans, LA, 2023
work page 2023
-
[19]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 6864–6890, Bangkok, Thailand, 2024
work page 2024
-
[20]
Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain, 2020
work page 2020
-
[21]
Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963
work page 1963
-
[22]
Understanding the planning of LLM agents: A survey
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InProceedings of the 2nd Conference on Language Modeling, Montreal, Canada, 2025. 11
work page 2025
-
[24]
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611, Vancouver, Canada, 2017
work page 2017
-
[25]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...
work page 2019
-
[26]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems 33, pages 9459–9474, 2020
work page 2020
-
[27]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Chatsop: An sop-guided mcts planning framework for controllable llm dialogue agents
Zhigen Li, Jianxiang Peng, Yanmeng Wang, Yong Cao, Tianhao Shen, Minghui Zhang, Linxi Su, Shang Wu, Yihang Wu, YuQian Wang, Ye Wang, Wei Hu, Jianfeng Li, Shaojun Wang, Jing Xiao, and Deyi Xiong. Chatsop: An sop-guided mcts planning framework for controllable llm dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for Computational...
work page 2025
-
[29]
Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Hui Liu, Xiang Zhang, and Suhang Wang. A comprehensive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724, 2025
-
[30]
Skillact: Using skill abstractions improves llm agents
Anthony Zhe Liu, Jongwook Choi, Sungryull Sohn, Yao Fu, Jaekyeom Kim, Dong-Ki Kim, Xinhe Wang, Jaewon Yu, and Honglak Lee. Skillact: Using skill abstractions improves llm agents. InICML 2024 Workshop on LLMs and Cognition, Vienna, Austria, 2024
work page 2024
-
[31]
SimpleMem: Efficient Lifelong Memory for LLM Agents
Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for LLM agents.arXiv preprint arXiv:2601.02553, 2026
work page internal anchor Pith review arXiv 2026
-
[32]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
-
[33]
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 9802–9822, Toronto, Canada, 2023
work page 2023
-
[36]
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents
Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Skill-pro: Learning reusable skills from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026. 12
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, 2025
work page 2025
-
[39]
OpenAI. Deep research system card. https://openai.com/index/ deep-research-system-card, 2025
work page 2025
-
[40]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems 37, Vancouver, Canada, 2024
work page 2024
-
[41]
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Bäck. Multi-step reasoning with large language models, a survey.ACM Computing Surveys, 58(6): 160:1–160:35, 2026
work page 2026
-
[42]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, 2023
work page 2023
-
[43]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. InProceedings of the 12th International Conference...
work page 2024
-
[44]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36, New Orleans, LA, 2023
work page 2023
-
[45]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Semi-off-policy reinforcement learning for vision- language slow-thinking reasoning
Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, and Kai Chen. Semi-off-policy reinforcement learning for vision- language slow-thinking reasoning. InAdvances in Neural Information Processing Systems 38, San Diego, CA, 2025
work page 2025
-
[48]
Zhuocheng Shen. Llm with tools: A survey.arXiv preprint arXiv:2409.18807, 2024
-
[49]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems 36, New Orleans, LA, USA, 2023
work page 2023
-
[50]
Alfworld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InProceedings of the 9th International Conference on Learning Representations, Virtual Conference, 2021
work page 2021
-
[51]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[52]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6): 186345, 2024
work page 2024
-
[54]
WebXSkill: Skill Learning for Autonomous Web Agents
Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, and Huaxiu Yao. Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment
Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025
-
[57]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. Co-evolving llm decision and skill bank agents for long-horizon tasks.arXiv preprint arXiv:2604.20987, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[60]
Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. Openagents: An open platform for language agents in the wild. In Proceedings of the 1st Conference on Language Modeling, Philadelphia, PA, 2024
work page 2024
-
[61]
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprin...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018
work page 2018
-
[65]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 2023. 14
work page 2023
-
[66]
Sop-agent: Empower general purpose ai agent with domain-specific sops
Anbang Ye, Qianran Ma, Jia Chen, Muqi Li, Tong Li, Fujiao Liu, Siqi Mai, Meichen Lu, Haitao Bao, and Yang You. Sop-agent: Empower general purpose ai agent with domain-specific sops. arXiv preprint arXiv:2501.09316, 2025
-
[67]
Tik Yu Yim, Wenting Tan, Sum Yee Chan, Tak-Wah Lam, and Siu Ming Yiu. Asda: Automated skill distillation and adaptation for financial reasoning.arXiv preprint arXiv:2603.16112, 2026
-
[68]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Experience compression spectrum: Unifying memory, skills, and rules in llm agents.arXiv preprint arXiv:2604.15877, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[71]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Expel: Llm agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the 38th AAAI Conference on Artificial Intelligence, pages 19632–19642, Vancouver, Canada, 2024
work page 2024
-
[73]
Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, and Kai Chen. Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning. InProceedings of the 14th International Conference on Learning Representations, Rio de Janeiro, Brazil, 2026
work page 2026
-
[74]
Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185, 2025
-
[75]
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,
YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Skillrouter: Skill routing for llm agents at scale.arXiv preprint arXiv:2603.22455, 2026
-
[77]
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, and Weinan Zhang. Externalization in llm agents: A unified review of memory, skills, protocols an...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[78]
### TASK: {task_type} ### For {readable_task} tasks, apply these specific strategies:
{title}: {principle} Apply this when {when_to_apply} ... ### TASK: {task_type} ### For {readable_task} tasks, apply these specific strategies:
-
[79]
Figure A5: Skill insertion format used by SLIM and skill-conditioned baselines
{title}: {principle} Apply this when {when_to_apply} ... Figure A5: Skill insertion format used by SLIM and skill-conditioned baselines. General skills are inserted as a group when active, while task-specific skills are retrieved by task type and semantic similarity. Title: Zero-Shot and Few-Shot Prompting Zero-shot: {obs_text} Few-shot prefix: Below are ...
-
[80]
Routing-oriented description
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.