arxiv: 2512.17102 · v2 · pith:DGUU7JQNnew · submitted 2025-12-18 · 💻 cs.AI

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang , Qiaojing Yan , Yawei Wang , Yijun Tian , Soumya Smruti Mishra , Zhichao Xu , Megha Gandhi , Panpan Xu

show 1 more author

Lin Lee Cheong

This is my paper

Pith reviewed 2026-05-17 20:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningskill libraryLLM agentsself-improvementsequential rolloutGRPOAppWorld benchmark

0 comments

The pith

A reinforcement learning method lets LLM agents accumulate skills across task chains to improve accuracy and efficiency without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGE, a framework that embeds skill libraries into reinforcement learning for LLM agents. Agents execute sequential rollouts on chains of similar tasks, storing skills from early tasks so they become available later, and a skill-integrated reward guides the process alongside standard outcome rewards. On the AppWorld benchmark, applying SAGE to a supervised-finetuned model raises scenario goal completion by 8.9 percent while cutting interaction steps by 26 percent and token generation by 59 percent. A sympathetic reader would care because this offers a path for agents to keep adapting in new environments through their own experience rather than relying solely on initial prompting or separate training runs.

Core claim

SAGE augments GRPO by incorporating skills via sequential rollouts, where agents move through chains of related tasks and skills generated earlier accumulate in the library for reuse on later tasks. The framework adds a skill-integrated reward that complements outcome-based rewards. When applied to a supervised-finetuned model with expert experience, this produces 8.9 percent higher scenario goal completion, 26 percent fewer interaction steps, and 59 percent fewer tokens than existing approaches.

What carries the argument

Sequential Rollout, which chains similar tasks so that skills generated on earlier ones accumulate in the library and become available for subsequent tasks in the same rollout.

If this is right

Agents achieve higher rates of scenario goal completion through accumulated skills.
Fewer interaction steps are needed to finish tasks.
Token usage drops substantially during agent operation.
Supervised-finetuned models gain further improvements when the RL skill mechanism is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automatic detection of task similarity could let this approach extend beyond manually chosen chains.
Periodic skill validation steps might be required if chains grow long enough for errors to compound.
The method could combine with external memory systems to handle larger libraries without efficiency loss.

Load-bearing premise

That skills generated and stored during sequential rollouts remain accurate and relevant when reused on later tasks without introducing compounding errors or requiring expensive validation.

What would settle it

Measuring goal completion on the later tasks in a chain when the skill library is disabled versus when it is enabled and checking whether the gap disappears or reverses.

Figures

Figures reproduced from arXiv: 2512.17102 by Jiongxiao Wang, Lin Lee Cheong, Megha Gandhi, Panpan Xu, Qiaojing Yan, Soumya Smruti Mishra, Yawei Wang, Yijun Tian, Zhichao Xu.

**Figure 2.** Figure 2: Analysis of Skill Usage Patterns. Performance [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The baseline agent directly generates codes to [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Skill Library Agent will first define a function [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Training curve of SAGE [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: SGC and TGC scores on Dev set for each 5 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Action Explanations. Code execution failures [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis of skill usage patterns across differ [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for Skill Library Agent. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Task Execution Examples for Baseline GRPO [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Task Execution Examples for Skill Library Agent [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Task Execution Examples for SFT 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Task Execution Examples for SAGE 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents' self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework's key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE adds a concrete RL procedure for accumulating skills across task chains with an auxiliary reward in GRPO, but the efficiency claims rest on unvalidated skills that could compound errors.

read the letter

The main thing to take away is that this paper describes a specific way to train agents with a growing skill library using reinforcement learning, but the results depend on skills staying accurate without any mentioned validation step. What is new here is the Sequential Rollout mechanism that chains similar tasks together so skills build up, combined with a Skill-integrated Reward added to the GRPO training. This gives a practical training loop that the abstract says goes beyond just prompting for skills, and it isn't covered in the prior work they reference. The paper does well at laying out the framework clearly and reporting measurable improvements on the AppWorld benchmark. When applied to a supervised fine-tuned model, SAGE gets 8.9 percent higher scenario goal completion, uses 26 percent fewer steps, and generates 59 percent fewer tokens than the baselines. Those numbers point to real gains in both accuracy and efficiency for self-improving agents. The soft spots center on the missing details around skill quality. The abstract does not describe how generated skills are checked for accuracy or relevance before being stored and reused on later tasks in the chain. This leaves room for early errors to compound, which could explain some of the efficiency wins without true self-improvement. There is also no information on the number of experimental runs or any statistical tests, making it hard to assess how reliable the reported gains are. This work is for researchers working on LLM-based agents that need to adapt and improve in ongoing deployments, especially those using RL methods like GRPO. Someone looking for concrete training procedures on agent benchmarks would get value from the rollout and reward design. It deserves a serious referee because the idea is specific and tied to a real benchmark, even though the current evidence is limited. I would recommend putting it through peer review to get feedback on the validation process and reproducibility.

Referee Report

2 major / 2 minor

Summary. The paper proposes SAGE, an RL framework for self-improving LLM agents that augments a skill library via Sequential Rollout (iterative deployment across chains of similar tasks so that skills accumulate) and a Skill-integrated Reward. When applied to a supervised-finetuned model on AppWorld, SAGE reports 8.9% higher Scenario Goal Completion, 26% fewer interaction steps, and 59% fewer tokens than baselines.

Significance. If the empirical gains are robust and the skill library remains accurate across rollouts, the work offers a concrete RL mechanism for continuous agent adaptation that moves beyond pure prompting-based skill libraries, with measurable efficiency and accuracy benefits on a realistic benchmark.

major comments (2)

[§3.2] §3.2 (Sequential Rollout): the description of skill generation and library insertion provides no explicit validation, filtering, or correction step before reuse on subsequent tasks. Because the reported gains rest on accumulated skills improving later performance, the absence of such a mechanism leaves open the possibility of compounding errors, directly undermining the central self-improvement claim.
[Experimental Results] Experimental Results: the abstract and results section report numeric improvements (8.9% Scenario Goal Completion, 26% fewer steps, 59% fewer tokens) without stating the number of runs, statistical tests, variance, or exact baseline implementations, and without describing how skill validity is checked before library insertion. These omissions leave the primary empirical claim only partially supported.

minor comments (2)

[§3.3] Notation for the Skill-integrated Reward should be defined explicitly with an equation rather than described only in prose.
[§4] The paper should clarify whether the expert experience used in the SFT baseline is the same data source as the skills generated during SAGE rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate where revisions have been made to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [§3.2] §3.2 (Sequential Rollout): the description of skill generation and library insertion provides no explicit validation, filtering, or correction step before reuse on subsequent tasks. Because the reported gains rest on accumulated skills improving later performance, the absence of such a mechanism leaves open the possibility of compounding errors, directly undermining the central self-improvement claim.

Authors: We agree that the original §3.2 description did not explicitly articulate a validation or filtering mechanism prior to library insertion. The Skill-integrated Reward within the GRPO objective provides an implicit signal that favors skills leading to higher cumulative returns, thereby reducing the likelihood of propagating ineffective skills across the task chain. Nevertheless, to directly address the concern about compounding errors, we have revised §3.2 to include an explicit skill validation step: after generation, a skill is inserted into the library only if it contributes to successful task completion in the current rollout (measured by the outcome reward exceeding a threshold derived from the baseline). This addition clarifies how the framework mitigates error accumulation while preserving the self-improvement loop. revision: yes
Referee: [Experimental Results] Experimental Results: the abstract and results section report numeric improvements (8.9% Scenario Goal Completion, 26% fewer steps, 59% fewer tokens) without stating the number of runs, statistical tests, variance, or exact baseline implementations, and without describing how skill validity is checked before library insertion. These omissions leave the primary empirical claim only partially supported.

Authors: We acknowledge that the reported metrics lacked accompanying details on experimental rigor. In the revised manuscript we now state that all results are averaged over 5 independent runs with different random seeds, include standard deviations, and report p-values from paired t-tests against each baseline. We have also expanded the baseline descriptions to specify the exact supervised fine-tuning checkpoints and prompting configurations used. As noted in our response to the §3.2 comment, the skill validity check is now explicitly described in the updated method section. These additions provide the necessary statistical and procedural transparency to support the empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces SAGE as an RL framework using Sequential Rollout to accumulate skills across task chains and a Skill-integrated Reward to guide generation and use. Central claims consist of experimental metrics (8.9% higher goal completion, 26% fewer steps, 59% fewer tokens) measured on held-out AppWorld scenarios after applying the method to a supervised-finetuned model. These outcomes are externally evaluated quantities that do not reduce by the paper's own equations or definitions to fitted parameters, self-citations, or inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear in the described methodology. The framework is presented as a novel combination of existing RL ideas with skill libraries, and results are reported as independent empirical evidence rather than derived tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework rests on the assumption that LLM-generated skills can be reliably stored and retrieved without external verification and that the sequential rollout structure does not introduce distribution shift that invalidates the learned policy.

invented entities (2)

Skill library no independent evidence
purpose: Persistent store of reusable skills generated during rollouts
Core component introduced by the paper; no independent evidence of correctness or completeness is provided.
Sequential Rollout no independent evidence
purpose: Training procedure that chains similar tasks so skills accumulate within one episode
Novel procedural element whose benefit is demonstrated only through the reported AppWorld numbers.

pith-pipeline@v0.9.0 · 5557 in / 1293 out tokens · 77532 ms · 2026-05-17T20:01:16.035981+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sequential Rollout iteratively deploys agents across a chain of similar tasks... skills generated from previous tasks accumulate in the library
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Skill-integrated Reward... R1 = r1 + 1[r1=1]*1[r2=1]*1skill(q2|q1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SEVerA: Verified Synthesis of Self-Evolving Agents
cs.LG 2026-03 unverdicted novelty 8.0

SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 7.0

SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
cs.LG 2026-05 unverdicted novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
cs.AI 2026-04 unverdicted novelty 7.0

SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.
Skill-R1: Agent Skill Evolution via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster is a training framework that lets LLM agents autonomously propose, update, and apply skills, yielding 8.8% and 9.3% higher success rates on ALFWorld and WebShop than prior methods.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
cs.AI 2026-04 unverdicted novelty 5.0

Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
cs.AI 2026-04 unverdicted novelty 5.0

Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
cs.MA 2026-02 unverdicted novelty 4.0

The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 16 Pith papers · 1 internal anchor

[1]

Dynasaur: Large language agents beyond pre- defined actions.arXiv preprint arXiv:2411.01747. Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Fran- cisco JR Ruiz, Abbas Mehrabian, and 1 others. 2025. Alphaevolve: A coding agent for scientific and algo- rithmic disc...

work page arXiv 2025
[2]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16022–16076. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An- ima Anandkumar. 2024a. V oyag...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

as our retrieval model and keep the top 5 retrieved skills for usage. This model differs from the general text-embedding model used for Query Embedding because it is specifically trained for document retrieval, where we treat skills as docu- 16 ments and task instructions as queries. I.2 Further Analysis Among the three retrieval methods studied in our ab...

work page