pith. sign in

arxiv: 2606.01619 · v2 · pith:N5WAHF7Inew · submitted 2026-06-01 · 💻 cs.AI · cs.LG· stat.ML

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Pith reviewed 2026-06-28 14:47 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML
keywords agentic RLskill creationpolicy optimizationGRPOThompson SamplingLLM agentsreusable skillsskill lifecycle
0
0 comments X

The pith

ReSkill embeds skill creation inside policy optimization so skills evolve with the agent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that skill creation can be integrated directly into the ongoing policy optimization process for LLM agents rather than handled separately. This matters to a sympathetic reader because decoupling the two risks creating skills that conflict with the improving policy or fail to generalize across tasks. If the integration works, agents would accumulate reusable conditional strategies automatically from environment rewards, leading to stronger performance especially on tasks not encountered during training. The approach exploits an existing group-wise structure in GRPO to add the necessary mechanisms at low cost.

Core claim

ReSkill is an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning by exploiting the group-wise structure of GRPO to embed three mechanisms with marginal overhead: an assertion-driven skill creator that diagnoses failures and proposes conditional trigger-based revisions, within-group rollout sampling that compares skill versions during learning, and Thompson Sampling with adaptive discounting that balances exploration and exploitation in skill selection. Across domains this produces consistent outperformance over memory and skill-based RL baselines, with the largest gains on unseen tasks, while skills are created, tested, refined, and pruned automatica

What carries the argument

The group-wise structure of GRPO that naturally embeds the assertion-driven creator, within-group sampling, and Thompson Sampling with adaptive discounting.

If this is right

  • Skills are created automatically from diagnoses of past failures.
  • Within-group sampling allows direct comparison of skill versions during rollouts.
  • Thompson Sampling with adaptive discounting selects skill versions while the policy changes.
  • ReSkill outperforms existing memory and skill-based RL methods, with largest gains on unseen tasks.
  • Skills are created, tested, refined, and pruned as the policy improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding idea could extend to other group-based RL methods if they support controlled version comparison.
  • Agents might sustain longer task horizons by growing a library of trigger-based skills without external memory modules.
  • The co-evolution dynamic could be tested by measuring how often pruned skills are later re-proposed after policy shifts.
  • Environments with greater non-stationarity would stress-test whether adaptive discounting prevents premature skill fixation.

Load-bearing premise

The group-wise structure of GRPO can embed the three mechanisms without introducing conflicts as the policy evolves.

What would settle it

A controlled experiment on one of the paper's benchmarks in which enabling the three embedded mechanisms causes average policy return to fall below the no-skill-creation baseline.

Figures

Figures reproduced from arXiv: 2606.01619 by Bernie Wang, Boran Han, Haotian Lin, Haoyang Fang, Matthew Reimherr, Runze Li, Wei Zhu, Xuan Zhu, Zelin He.

Figure 1
Figure 1. Figure 1: (a) Inspired by Anthropic’s human-in-the-loop Skill Creator, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RESKILL. (1) RL training with within-group skill testing (§3.1). (2) RL-in-the-loop skill creation (§3.2). (3) RL-guided skill evolution with Thompson Sampling (§3.3). 3.2 RL-in-the-Loop Skill Creation The within-group sampling (§3.1) presents a cost free skill evaluation mechanism; this section describes the skill creator pipeline that produces candidate skill versions for evaluation. We adapt… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics on held-out validation subsets. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance on additional benchmarks: ScienceWorld (electricity tasks), InterCode-SQL, and WANDS. The gap between RESKILL and baselines widens consistently on harder or out-of-domain tasks. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cost analysis on ALFWorld (4B), reported as ratios vs. GRPO. Left two: training step time and context length. Right two: inference time per episode and context length. RESKILL (star) achieves the highest accuracy ratio while maintaining competitive overhead. Robustness to Skill Creator Choice. A natural concern is whether the improvements primarily reflect the stronger LLM (Claude 4.5 Sonnet) used for skil… view at source ↗
Figure 5
Figure 5. Figure 5: Test-time cross-domain adaptation from ALFWorld to ScienceWorld. RESKILL rapidly adapts skills to a new domain while baselines remain near zero. Right panel shows accepted and rejected skill operations during adaptation. Efficiency [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Skill-policy co-evolution on ALFWorld (left) and Search (right). Training curves are shown with colored stage bands alongside key skill operations: + add (new skill created) and − delete (internalized skill pruned). 4.4 Qualitative Analysis Skill Lifecycle [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-arm discount weights wt = 1/(1 + nt/Mˆ ) during training on Search. Solid lines use the online Mˆ k ; dotted lines use the final Mˆ . The two arms’ weights are anti-correlated: Thompson Sampling allocates more episodes to the favored arm, lowering its wt (more forgetting), while the minority arm retains more history. Green/red vertical lines indicate accepted/rejected evolution cycles. Total Objective.… view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity to evolution frequency Ke and skill bank size on ALFWorld and Search. Default settings: Ke=5, bank size 8. C.7 Sensitivity Analysis [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ReSkill, an RL-in-the-loop skill creation framework for agentic RL that reconciles skill evolution with policy optimization. It exploits GRPO's group-wise rollouts to embed three mechanisms—an assertion-driven skill creator that proposes conditional revisions from failure diagnoses, within-group sampling for controlled version comparison, and Thompson Sampling with adaptive discounting for version selection—claiming only marginal overhead. The work reports consistent outperformance over memory and skill-based RL baselines, with largest gains on unseen tasks, alongside automatic skill creation, testing, refinement, and pruning as the policy improves.

Significance. If the claimed integration of the three mechanisms into GRPO proceeds without selection bias or policy-skill conflicts, the result would address a central limitation of prior skill-augmented RL by enabling systematic accumulation of reusable, generalizable strategies. The design choice to leverage an existing group-wise structure rather than introduce separate skill and policy loops is a conceptual strength that could reduce overhead if empirically validated.

major comments (2)
  1. [Abstract] Abstract: the central claim that the three mechanisms 'naturally embed' into GRPO's group-wise structure 'with only marginal additional overhead' and 'without introducing conflicts' is load-bearing for the reconciled co-evolution result, yet the text provides no derivation or analysis showing that (a) failure assertions can be computed from within-group trajectories without altering advantage estimates, (b) version sampling remains unbiased during ongoing policy updates, or (c) the adaptive discount in Thompson Sampling does not interact with GRPO group normalization.
  2. [Abstract] The reported largest gains on unseen tasks are presented as evidence that skills generalize via reconciled co-evolution, but without explicit checks that within-group version comparison avoids favoring revisions that merely exploit transient policy states, the gains could be artifacts of the particular GRPO implementation rather than a general property of the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where we will revise the manuscript to strengthen the presentation of the framework's properties.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the three mechanisms 'naturally embed' into GRPO's group-wise structure 'with only marginal additional overhead' and 'without introducing conflicts' is load-bearing for the reconciled co-evolution result, yet the text provides no derivation or analysis showing that (a) failure assertions can be computed from within-group trajectories without altering advantage estimates, (b) version sampling remains unbiased during ongoing policy updates, or (c) the adaptive discount in Thompson Sampling does not interact with GRPO group normalization.

    Authors: We agree that the abstract's claims would be strengthened by explicit analysis. The design computes assertions only after full group trajectories are collected, leaving GRPO advantage estimates unchanged; version sampling draws from a separate historical performance buffer updated between policy steps; and the adaptive discount modulates only the prior over skill versions, independent of intra-group normalization. We will add a short formal subsection in Section 3 deriving these separation properties. revision: yes

  2. Referee: [Abstract] The reported largest gains on unseen tasks are presented as evidence that skills generalize via reconciled co-evolution, but without explicit checks that within-group version comparison avoids favoring revisions that merely exploit transient policy states, the gains could be artifacts of the particular GRPO implementation rather than a general property of the framework.

    Authors: The within-group design ensures that all skill versions are evaluated under identical policy parameters and environment conditions for that rollout batch, which directly controls for transient state effects. The larger gains on unseen tasks are therefore measured under this controlled comparison. We will add a paragraph and supplementary figure in Section 4 clarifying this control and showing version-selection stability across consecutive policy updates. revision: partial

Circularity Check

0 steps flagged

No circularity: framework leverages external GRPO structure without self-referential reductions

full rationale

The paper proposes ReSkill as an integration that exploits the pre-existing group-wise structure of GRPO to embed assertion-driven creation, within-group sampling, and Thompson Sampling. No equations, fitted parameters, or derivations are shown that reduce by construction to their own inputs. The approach is explicitly inspired by external Anthropic work rather than self-citation chains, and performance claims rest on empirical results across domains rather than any uniqueness theorem or ansatz smuggled via prior author work. The derivation chain is therefore self-contained as a proposed engineering reconciliation rather than a tautological re-labeling or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The framework assumes GRPO's group structure and Thompson Sampling are available as background.

pith-pipeline@v0.9.1-grok · 5774 in / 1058 out tokens · 19934 ms · 2026-06-28T14:47:46.713180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Neural Bandit Based Optimal LLM Selection for a Pipeline of Subtasks

    GitHub repository. Baran Atalar, Eddie Zhang, and Carlee Joe-Wong. Neural bandit based optimal LLM selection for a pipeline of tasks.arXiv preprint arXiv:2508.09958, 2025. Djallel Bouneffouf and Raphael Feraud. Survey: Multi-armed bandits meet large language models, 2025. URLhttps://arxiv.org/abs/2505.13355. Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Z...

  2. [2]

    Agent-RLVR: Training software engineering agents via guidance and environment rewards.arXiv:2506.11425, 2025

    URLhttps://arxiv.org/abs/2506.11425. Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, and John Lui. Cost-effective online multi-LLM selection with versatile reward models.arXiv preprint arXiv:2405.16587, 2024. Jiazhan Feng, Shang Huang, Xin Qu, Ge Zhang, Yujia Qin, Bing Zhong, Chaojie Jiang, Jiangjie Chi, and Weiwen Zhong. Retool: Reinforcement learning for s...

  3. [3]

    13 Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al

    URLhttps://arxiv.org/abs/2505.16421. 13 Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025. Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Si...

  4. [4]

    guiding principles

    prompts the base model with a reasoning-plus-action format, interleaving chain-of-thought traces with environment actions.Claude Sonnet 4.5uses the same ReAct prompt under identical evaluation protocols, serving as a proprietary reference. Evolvement Methods.REME(Cao et al., 2025) is a dynamic procedural memory framework that extracts, stores, and retriev...

  5. [5]

    The trigger condition serves the routing role, andwhen_to_useserves the relevance-filtering role

    In-context applicability: The when_to_use field tells the policy when the loaded guidance should be followed. The trigger condition serves the routing role, andwhen_to_useserves the relevance-filtering role. C.4.3 Trigger Validation and Optimization Motivated by Skill Creator’s description optimization phase, we validate proposed triggers before the versi...

  6. [6]

    Initializeα 0 =1,β p,0 =1 (uniform Beta(1, 1)prior)

  7. [7]

    ,T: (a) Discount: ˜αt = M M+nt ·α t−1, ˜βp,t = M M+nt ·β p,t−1

    Fort=1, . . . ,T: (a) Discount: ˜αt = M M+nt ·α t−1, ˜βp,t = M M+nt ·β p,t−1. (b) Score:ℓ t =logP(m t |n t, ˜αt, ˜βp,t)(Eq. 11). (c) Update:α t = ˜αt +m t,β p,t = ˜βp,t + (nt −m t)

  8. [8]

    insight":

    Returnℓ(M;d) = ∑T t=1 ℓt. 22 Figure 8: Per-arm discount weights wt = 1/(1 +n t/ ˆM) during training on Search. Solid lines use the online ˆMk; dotted lines use the final ˆM. The two arms’ weights are anti-correlated: Thompson Sampling allocates more episodes to the favored arm, lowering its wt (more forgetting), while the minority arm retains more history...

  9. [9]

    search[Iittala bowl white porcelain] [2] click[w21216]

    limit-search-retriesfires. search[Iittala bowl white porcelain] [2] click[w21216]

  10. [10]

    click[back to search]

    require-color-selectionfires — white NOT in available colors. click[back to search]

  11. [11]

    click[w19864]

    no-description-tabsprevents tab exploration. click[w19864]

  12. [12]

    I am applying the skill:require-color-selectionbecause the task specifies ‘color: white’ and the color dropdown is visible

    require-color-selectionfires — white IS available. “I am applying the skill:require-color-selectionbecause the task specifies ‘color: white’ and the color dropdown is visible.” click[color: white]

  13. [13]

    click[material: porcelain china]

    no-back-after-optionsfires. click[material: porcelain china]

  14. [14]

    Count the number of countries in Asia

    buy-immediately-after-optionsfires. click[buy now] SUCCESS Five distinct skills compose into a conditional state machine with branching: require-color-selection plays a dual role, blocking premature purchase when color is available but providing the only legitimate escape when color is unavailable. F.5 InterCode-SQL Structured Verification Database:world_...