arxiv: 2605.06130 · v3 · submitted 2026-05-07 · 💻 cs.AI

Recognition: no theorem link

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi , Yuxin Chen , Zhengxi Lu , Yuchun Miao , Shugui Liu , Qi Gu , Xunliang Cai , Xiang Wang

show 1 more author

An Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill libraryreinforcement learninglanguage model agentsskill distillationALFWorldWebShopunified evolutionfrequency decomposition

0 comments

The pith

A single policy can co-evolve skill selection, utilization, and distillation from one task-outcome signal by separating its low-frequency trend and high-frequency variation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Skill1 trains one policy to handle every part of maintaining a skill library: generating a search query, re-ranking results to pick a skill, solving the current task with it, and then distilling a new skill from the outcome. All of this learning comes from a single final reward for task success. The policy splits that reward signal so its slow trend teaches better skill choices while its quick fluctuations teach better skill creation. Experiments on ALFWorld and WebShop show the approach beats separate-training baselines, and training curves confirm the three abilities improve together.

Core claim

Skill1 is a framework in which one policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on the chosen skill, and distills a new skill from the trajectory, with every update driven by a single task-outcome signal whose low-frequency trend supplies credit for selection and whose high-frequency variation supplies credit for distillation.

What carries the argument

The single RL policy that integrates query generation for skill retrieval, candidate re-ranking for selection, conditioned task execution, and trajectory-based distillation, with credit assignment performed by frequency decomposition of the shared outcome reward.

If this is right

The three capabilities of selection, utilization, and distillation improve simultaneously during training under the shared objective.
Skill1 outperforms prior skill-based methods and standard reinforcement-learning baselines on the ALFWorld and WebShop benchmarks.
Removing the low-frequency credit signal or the high-frequency credit signal each degrades the co-evolution of the three capabilities.
All learning, including skill library growth, derives from the single task-outcome signal without auxiliary rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Skill libraries could expand more coherently across open-ended task sequences because selection and creation remain coupled through the same policy.
The frequency-separation idea for credit assignment might transfer to other agent settings that require learning at multiple timescales.
Less hand-designed reward engineering may be needed for long-term skill management if one outcome signal suffices for all three functions.

Load-bearing premise

The low-frequency trend and high-frequency variation of one task-outcome signal can be cleanly separated to supply non-conflicting credits for skill selection versus distillation.

What would settle it

An experiment in which the low- and high-frequency components of the outcome signal overlap strongly, so that ablating either component produces no performance drop relative to training selection and distillation with separate rewards.

Figures

Figures reproduced from arXiv: 2605.06130 by An Zhang, Qi Gu, Shugui Liu, Xiang Wang, Xunliang Cai, Yaorui Shi, Yuchun Miao, Yuxin Chen, Zhengxi Lu.

**Figure 1.** Figure 1: Training paradigms for skill-augmented agents. (a) The skill-augmented agent loop consists of selection, utilization, and distillation. (b) Prior methods delegate some stages to external modules without policy gradients (e.g., freezes selection or uses an external teacher for distillation). Skill1 trains a single policy across all three stages with a shared task-outcome signal. simultaneously? Existing met… view at source ↗

**Figure 2.** Figure 2: Overview of the Skill1 framework. (a) The policy generates a query and re-ranks retrieved candidates to select a skill. (b) The policy performs multi-turn interaction conditioned on the selected skill. (c) The policy reflects on the trajectory and distills a reusable skill. All learning signals are derived from the task-outcome r(τ ) to achieve co-evolution of three capabilities. where πθ is optimized with… view at source ↗

**Figure 3.** Figure 3: Training dynamics of the three capability metrics. Full Skill1 achieves fast and unified view at source ↗

**Figure 4.** Figure 4: Task-skill similarity at three training check view at source ↗

**Figure 6.** Figure 6: T-SNE visualization of the skill libraries after convergence, with and without RL-trained view at source ↗

**Figure 6.** Figure 6: T-SNE visualization of the skill libraries after convergence, with and without RL-trained [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Per-task success rates (mean ± std over 3 seeds). Skill1 consistently outperforms RetroAgent across all task types. E Broader Impacts This work develops a framework for LLM agents to autonomously acquire and reuse behavioral skills through reinforcement learning. On the positive side, the approach can reduce the manual engineering effort required to build capable agents and enable more sample-efficient le… view at source ↗

**Figure 7.** Figure 7: Per-task success rates (mean ± std over 3 seeds). Skill1 consistently outperforms best baseline RetroAgent on five of six task types and the average score. E Broader Impacts This work develops a framework for LLM agents to autonomously acquire and reuse behavioral skills through reinforcement learning. On the positive side, the approach can reduce the manual engineering effort required to build capable age… view at source ↗

read the original abstract

A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skill1 unifies selection, utilization, and distillation in one policy via frequency split of a single task reward, showing gains on ALFWorld and WebShop but leaving the separation mechanism underspecified.

read the letter

Skill1 trains a single policy to query a skill library, pick and use a skill, solve the task, and distill a new skill, with all updates coming from one outcome signal. Low-frequency trends are meant to credit selection while high-frequency variation credits distillation. This setup is positioned as fixing the partial or conflicting progress that comes from optimizing the three pieces separately or with different rewards. The experiments report better results than prior skill-augmented and plain RL baselines on ALFWorld and WebShop, plus ablations where dropping either credit signal hurts the overall evolution. Training curves are said to show the three capabilities improving together. That is the concrete advance: a joint policy plus a frequency-based credit rule that tries to keep the signals from fighting each other. The frequency idea is simple and avoids extra reward engineering, which is a practical plus if it holds up. The main soft spot is that the paper gives almost no detail on how the low- and high-frequency components are actually extracted or filtered from the sparse, delayed outcome signal. In environments like these, any moving average or spectral split risks mixing gradients once the utilization head starts changing the trajectory distribution. Without the exact method, error bars, or statistical tests, it is hard to judge whether the reported gains come from clean separation or from other factors. The central assumption—that one scalar signal can be cleanly partitioned without entanglement—needs more evidence than the current ablations supply. This paper is for people working on LLM agents that maintain reusable skills in interactive settings such as household simulators or web tasks. Readers who care about reducing separate training loops or about RL credit assignment in long-horizon agents will get the most out of it. I would send it to peer review. The idea is worth checking in detail even if the frequency mechanism turns out to need tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces Skill1, a unified RL framework in which a single policy co-evolves three capabilities—skill selection (via query generation and re-ranking), utilization, and distillation—by deriving all credit from one scalar task-outcome signal. Low-frequency trends of this signal are used to credit selection while high-frequency variations credit distillation; the policy is trained end-to-end on ALFWorld and WebShop, outperforming prior skill-based and RL baselines, with ablations confirming degradation when either frequency component is removed.

Significance. If the frequency-based credit separation can be shown to remain non-conflicting under the joint optimization and sparse-reward conditions of the target domains, the work would offer a parameter-free mechanism for maintaining coherent skill libraries without separate reward engineering. The reported outperformance and co-evolution dynamics would then constitute a concrete advance over methods that optimize the three capabilities in isolation.

major comments (3)

[Abstract and §3] Abstract and §3 (Method): the claim that low-frequency trends cleanly credit selection while high-frequency variations credit distillation is load-bearing for the central contribution, yet no concrete filter (moving average, spectral cutoff, etc.), stationarity assumptions, or gradient-flow analysis is provided. In sparse-reward settings such as ALFWorld, any practical decomposition risks mixing selection and distillation gradients once utilization updates alter the trajectory distribution.
[§4] §4 (Experiments): the ablation results that remove credit signals are reported without error bars, statistical significance tests, or exact implementation details of the frequency extraction. This prevents verification that the observed degradation is attributable to loss of the claimed credit separation rather than implementation artifacts.
[§3.2] §3.2 (Policy architecture): the joint optimization over query-generation, re-ranking, utilization, and distillation heads creates an entanglement risk that is not analyzed; updates to the utilization head necessarily change the distribution of trajectories whose outcome signal is then decomposed for the other heads.

minor comments (2)

[§3] Notation for the frequency decomposition (e.g., symbols for low- and high-pass components) should be introduced once and used consistently throughout the method and analysis sections.
[§3] The manuscript would benefit from a short pseudocode listing the exact sequence of query generation, skill retrieval, execution, outcome extraction, and frequency-based credit assignment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our manuscript. We address each major comment below and have made revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): the claim that low-frequency trends cleanly credit selection while high-frequency variations credit distillation is load-bearing for the central contribution, yet no concrete filter (moving average, spectral cutoff, etc.), stationarity assumptions, or gradient-flow analysis is provided. In sparse-reward settings such as ALFWorld, any practical decomposition risks mixing selection and distillation gradients once utilization updates alter the trajectory distribution.

Authors: We agree that providing concrete implementation details is essential for reproducibility and to substantiate the central claim. In the revised manuscript, we specify the frequency decomposition method as an exponential moving average with a smoothing factor of 0.9 for the low-frequency trend, with the high-frequency component derived as the residual. We include a brief analysis of the gradient flow, demonstrating that selection gradients are computed at the episode level using the low-frequency signal, while distillation uses per-step high-frequency variations, minimizing interference in sparse-reward environments. Stationarity is assumed over short task horizons, which holds in our experimental setups. We have added this to §3. revision: yes
Referee: [§4] §4 (Experiments): the ablation results that remove credit signals are reported without error bars, statistical significance tests, or exact implementation details of the frequency extraction. This prevents verification that the observed degradation is attributable to loss of the claimed credit separation rather than implementation artifacts.

Authors: We acknowledge the need for rigorous statistical reporting. In the revision, we have added error bars representing standard deviation over 5 random seeds for all ablation results. We performed paired t-tests to confirm statistical significance of the performance drops (p < 0.05). Additionally, we provide the exact hyperparameters for the frequency extraction in the appendix, including the moving average parameters and how residuals are computed. revision: yes
Referee: [§3.2] §3.2 (Policy architecture): the joint optimization over query-generation, re-ranking, utilization, and distillation heads creates an entanglement risk that is not analyzed; updates to the utilization head necessarily change the distribution of trajectories whose outcome signal is then decomposed for the other heads.

Authors: This is a valid concern regarding potential distribution shift during joint training. We have added a new subsection in §3.2 analyzing this entanglement risk. We show that by alternating updates or using a replay buffer for trajectory sampling, the distribution changes are mitigated. Additional experiments in the revision demonstrate that the co-evolution remains coherent, with skill selection accuracy improving steadily despite utilization updates. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Skill1 derivation chain

full rationale

The paper's central mechanism extracts credit signals for skill selection and distillation by frequency decomposition of an external task-outcome reward. This is a direct methodological assignment applied to an observed scalar signal rather than a quantity defined in terms of itself or a fitted parameter relabeled as a prediction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided derivation. The co-evolution claim rests on the joint policy optimization under the shared signal, which remains falsifiable against external benchmarks such as ALFWorld and WebShop performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that frequency components of the task-outcome signal can be used to separately optimize the three capabilities without interference; no explicit numerical free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption A single policy can simultaneously optimize skill selection, utilization, and distillation when credit is assigned via low-frequency trends for selection and high-frequency variation for distillation from one task-outcome signal.
This premise is required for the unified training to produce non-conflicting evolution of the three capabilities.

pith-pipeline@v0.9.0 · 5499 in / 1462 out tokens · 119969 ms · 2026-05-13T07:13:17.220758+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 24 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Journal of artificial intelligence research , volume=

Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=

work page
[7]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[9]

International Conference on Machine Learning , pages=

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training , author=. International Conference on Machine Learning , pages=. 2025 , organization=

work page 2025
[10]

IEEE Transactions on Robotics , volume=

Partially observable markov decision processes in robotics: A survey , author=. IEEE Transactions on Robotics , volume=. 2022 , publisher=

work page 2022
[11]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Group-in-Group Policy Optimization for LLM Agent Training , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[14]

arXiv preprint arXiv:2512.16848 , year=

Meta-RL Induces Exploration in Language Agents , author=. arXiv preprint arXiv:2512.16848 , year=

work page arXiv
[15]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[17]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models , author=. arXiv preprint arXiv:2501.03262 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning , author=. arXiv preprint arXiv:2602.08234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Yu Li, Rui Miao, Zhengling Qi, and Tian Lan

Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning , author=. arXiv preprint arXiv:2603.16060 , year=

work page arXiv
[20]

Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026

Complementary Reinforcement Learning , author=. arXiv preprint arXiv:2603.17621 , year=

work page arXiv
[21]

arXiv preprint arXiv:2603.08561 , year=

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback , author=. arXiv preprint arXiv:2603.08561 , year=

work page arXiv
[22]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory , author=. arXiv preprint arXiv:2511.20857 , year=

work page internal anchor Pith review arXiv
[23]

Exgrpo: Learning to reason from experience

Exgrpo: Learning to reason from experience , author=. arXiv preprint arXiv:2510.02245 , year=

work page arXiv
[24]

arXiv preprint arXiv:2603.16856 , year=

Online Experiential Learning for Language Models , author=. arXiv preprint arXiv:2603.16856 , year=

work page arXiv
[25]

Memory Intelligence Agent

Memory Intelligence Agent , author=. arXiv preprint arXiv:2604.04503 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2603.08068 , year=

In-Context Reinforcement Learning for Tool Use in Large Language Models , author=. arXiv preprint arXiv:2603.08068 , year=

work page arXiv
[27]

arXiv preprint arXiv:2603.01145 , year=

AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution , author=. arXiv preprint arXiv:2603.01145 , year=

work page arXiv
[28]

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried

XSkill: Continual Learning from Experience and Skills in Multimodal Agents , author=. arXiv preprint arXiv:2603.12056 , year=

work page arXiv
[29]

arXiv preprint arXiv:2603.28088 , year=

GEMS: Agent-Native Multimodal Generation with Memory and Skills , author=. arXiv preprint arXiv:2603.28088 , year=

work page arXiv
[30]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench: Benchmarking how well agent skills work across diverse tasks , author=. arXiv preprint arXiv:2602.12670 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2602.19672 , year=

SkillOrchestra: Learning to Route Agents via Skill Transfer , author=. arXiv preprint arXiv:2602.19672 , year=

work page arXiv
[32]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

SoK: Agentic Skills--Beyond Tool Use in LLM Agents , author=. arXiv preprint arXiv:2602.20867 , year=

work page internal anchor Pith review arXiv
[33]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Agent skills for large language models: Architecture, acquisition, security, and the path forward , author=. arXiv preprint arXiv:2602.12430 , year=

work page internal anchor Pith review arXiv
[34]

arXiv preprint arXiv:2603.18718 , year=

MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution , author=. arXiv preprint arXiv:2603.18718 , year=

work page arXiv
[35]

arXiv preprint arXiv:2604.01599 , year=

ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context , author=. arXiv preprint arXiv:2604.01599 , year=

work page arXiv
[36]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

work page
[37]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Expel: Llm agents are experiential learners , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[39]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =

Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =

work page
[41]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , volume =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , volume =

work page
[42]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-

work page
[44]

Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

Reinforcement learning for self-improving agent with skill library , author=. arXiv preprint arXiv:2512.17102 , year=

work page arXiv
[45]

Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization , author=. arXiv preprint arXiv:2604.02268 , year=

work page arXiv
[46]

Intrinsically-Motivated and Open-Ended Learning Workshop@ NeurIPS2023 , year=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Intrinsically-Motivated and Open-Ended Learning Workshop@ NeurIPS2023 , year=

work page
[47]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint arXiv:2602.02474 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Evolver: Self-evolving llm agents through an experience-driven lifecycle , author=. arXiv preprint arXiv:2510.16079 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory , author=. arXiv preprint arXiv:2601.03192 , year=

work page arXiv
[50]

arXiv preprint arXiv:2601.02553 , year=

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. arXiv preprint arXiv:2601.02553 , year=

work page arXiv
[51]

Retroformer: Retrospective large language agents with policy gradient optimization

Retroformer: Retrospective large language agents with policy gradient optimization , author=. arXiv preprint arXiv:2308.02151 , year=

work page arXiv
[52]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

work page
[53]

arXiv preprint arXiv:2601.12538 , year=

Agentic reasoning for large language models , author=. arXiv preprint arXiv:2601.12538 , year=

work page arXiv
[54]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page
[55]

Forty-second International Conference on Machine Learning Position Paper Track , year=

Position: Truly Self-Improving Agents Require Intrinsic Metacognitive Learning , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=

work page
[56]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[58]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[59]

2018 , edition=

Reinforcement Learning: An Introduction , author=. 2018 , edition=

work page 2018
[60]

International Conference on Machine Learning , pages=

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[61]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

Agent q: Advanced reasoning and learning for autonomous ai agents , author=. arXiv preprint arXiv:2408.07199 , year=

work page arXiv
[62]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=

Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=

work page
[63]

Skillnet: Create, evaluate, and connect ai skills.arXiv preprint arXiv:2603.04448,

SkillNet: Create, Evaluate, and Connect AI Skills , author=. arXiv preprint arXiv:2603.04448 , year=

work page arXiv
[64]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author=. arXiv preprint arXiv:2603.02176 , year=

work page arXiv
[65]

arXiv preprint arXiv:2509.25123 , year=

From f (x) and g (x) to f (g (x)) : LLMs Learn New Skills in RL by Composing Old Ones , author=. arXiv preprint arXiv:2509.25123 , year=

work page arXiv
[66]

Agentic proposing: Enhancing large language model reasoning via compositional skill synthesis.arXiv preprint arXiv:2602.03279,

Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis , author=. arXiv preprint arXiv:2602.03279 , year=

work page arXiv
[67]

Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025

Memento: Fine-tuning llm agents without fine-tuning llms , author=. arXiv preprint arXiv:2508.16153 , year=

work page arXiv
[68]

Memp: Exploring Agent Procedural Memory

MemP: Exploring Agent Procedural Memory , author=. arXiv preprint arXiv:2508.06433 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive NLP tasks , author=. Advances in neural information processing systems , volume=

work page
[70]

International Conference on Machine Learning , pages=

Retrieval-augmented reinforcement learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[71]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[72]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

The landscape of agentic reinforcement learning for LLMs: A survey , author=. arXiv preprint arXiv:2509.02547 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Claude Blog , urk=

Introducing Agent Skills , author=. Claude Blog , urk=

work page
[74]

Machine learning , volume=

Finite-time analysis of the multiarmed bandit problem , author=. Machine learning , volume=. 2002 , publisher=

work page 2002
[75]

Psychological review , volume=

Acquisition of cognitive skill , author=. Psychological review , volume=. 1982 , publisher=

work page 1982
[76]

Transactions on Machine Learning Research , year=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=

work page
[77]

Advances in Neural Information Processing Systems , volume=

A definition of continual reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[78]

The Twelfth International Conference on Learning Representations , year=

Understanding the Effects of RLHF on LLM Generalisation and Diversity , author=. The Twelfth International Conference on Learning Representations , year=

work page
[79]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[80]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Search and Refine During Think: Facilitating Knowledge Refinement for Improved Retrieval-Augmented Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

Showing first 80 references.