SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Chengcheng Han; Jinyang Wu; Jun Xiao; Qi Gu; Weiming Lu; Xunliang Cai; Yongliang Shen; Yueting Zhuang; Zhengxi Lu; Zhiyuan Yao

arxiv: 2604.02268 · v2 · pith:PNWKDRK6new · submitted 2026-04-02 · 💻 cs.LG

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu , Zhiyuan Yao , Jinyang Wu , Chengcheng Han , Qi Gu , Xunliang Cai , Weiming Lu , Jun Xiao

show 2 more authors

Yueting Zhuang Yongliang Shen

This is my paper

Pith reviewed 2026-05-19 17:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords skill internalizationin-context RLLLM agentscurriculum learningzero-shot agentsreinforcement learningagentic tasks

0 comments

The pith

A curriculum of progressively withdrawing skill context during reinforcement learning lets agents internalize procedural knowledge into their parameters for zero-shot task completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SKILL0, a framework that trains language model agents to internalize skills rather than retrieve them during use. It starts training with full skill information provided in context and gradually removes it using a dynamic selection process based on whether the current policy still benefits from each skill file. This results in agents that can perform tasks autonomously without any external skill input at inference time. Readers would care if this holds because it reduces token usage and noise from retrieval while potentially improving overall performance on agent benchmarks.

Core claim

SKILL0 introduces an in-context reinforcement learning setup where skills are grouped by category and combined with interaction history into compact visual context. A Dynamic Curriculum then assesses each skill file's helpfulness to the current policy and retains only useful ones within a budget that decreases linearly, continuing until the agent functions without any skill context.

What carries the argument

The Dynamic Curriculum mechanism, which identifies on-policy helpfulness of skill files and manages their progressive withdrawal from the training context.

If this is right

The internalized agent maintains performance gains of roughly 7 to 10 percent across environment benchmarks compared to standard reinforcement learning without internalization.
Inference-time context usage stays below 0.5k tokens per step even as skills are no longer provided.
The agent learns tool invocation and multi-turn completion through the curriculum without relying on runtime retrieval.
Full zero-shot operation becomes possible after the curriculum completes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Internalized skills could compound over multiple training phases to handle increasingly complex tasks.
The method might apply to internalizing other types of knowledge, such as facts or strategies, in agent systems.
Checking performance on entirely new tasks after internalization would test if the skills have become general capabilities rather than task-specific memorization.
This could lower the need for large context windows in deployed agents.

Load-bearing premise

That the progressive withdrawal of context during training causes the model to truly encode the skills in its weights instead of learning to perform without them only because of the specific training distribution.

What would settle it

Running the trained agent on the benchmark tasks with all skill files completely removed from any context and observing whether success rates remain above the standard RL baseline or drop back to it.

Figures

Figures reproduced from arXiv: 2604.02268 by Chengcheng Han, Jinyang Wu, Jun Xiao, Qi Gu, Weiming Lu, Xunliang Cai, Yongliang Shen, Yueting Zhuang, Zhengxi Lu, Zhiyuan Yao.

**Figure 1.** Figure 1: Comparison of (a) Skill Augmentation methods and (b) our Skill Internalization method. ∗Work done during intership at Meituan. †Corresponding author arXiv:2604.02268v1 [cs.LG] 2 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of SKILL0. (a) Relevance-Driven Skill Grouping; (b) In-Context Reinforcement Learning with skill-enhanced agent loop; (c) Dynamic curriculum learning during training process. has emerged as a crucial post-training recipe for equipping LLM agents with robust decision-making capabilities (Lu et al., 2026, 2025; Feng et al., 2025). 2.2 Agentic Skills Early memory-based approaches store raw trajectori… view at source ↗

**Figure 3.** Figure 3: Comparison of training dynamics with AgentOCR on Qwen2.5-VL-3B. 0 24 48 72 96 120 Steps 0 2 4 6 8 10 Reward AgentOCR Skill0 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Training Dynamics Comparison. (a) Validation performance of SKILL0 (OCR) with and without skill augmentation, evaluated every 10 training steps. (b) Performance comparison between SKILL0 and AgentOCR, both evaluated without skill augmentation. (c) Performance comparison of SKILL0 (OCR) against GRPO (Text) and SkillRL (Text), all evaluated without skill augmentation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Training Dynamics of Helpfulness, which are reported by ∆k for each sub-task k [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Ablations of skill budget during training process. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Training dynamics of S [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Training dynamics of SKILL0 on Qwen2.5VL-3B, with SearchQA sub-tasks (split by skill categories) accuracy reported. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template used by SKILL0 for the ALFWorld embodied task environment. Prompt of SKILL0 on Search-based QA You are an expert agent tasked with answering the given question step-by-step. {skill_context} Your question: {task_description}. Prior to this step, you have already taken {step_count} step(s). The image contains the full history: • Past queries are inside <search>...</search> • Past results are… view at source ↗

**Figure 12.** Figure 12: Prompt template used by SKILL0 for the Search-based QA task environment. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld, +6.6\% for Search-QA, and+10.1\% for WebShop), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SKILL0's dynamic on-policy curriculum for skill internalization is a clean idea with decent benchmark gains, but the abstract leaves the core claim under-supported without ablations on the withdrawal schedule.

read the letter

The punchline is that SKILL0 shows a workable way to train agents so they stop needing runtime skill retrieval, using a curriculum that starts with full context and pulls it back via on-policy helpfulness filtering until the model runs zero-shot. The reported lifts are +9.7% on ALFWorld, +6.6% on Search-QA, and +10.1% on WebShop, all while keeping context under 0.5k tokens. That efficiency angle is useful for real deployment.

Referee Report

2 major / 2 minor

Summary. The paper introduces SKILL0, an in-context agentic RL framework for internalizing skills into LLM parameters. It starts with full skill context, groups skills by category, renders them with interaction history, and uses a Dynamic Curriculum to evaluate on-policy helpfulness and progressively withdraw context within a linearly decaying budget until zero-shot operation. Experiments claim gains over a standard RL baseline of +9.7% on ALFWorld, +6.6% on Search-QA, and +10.1% on WebShop, with context under 0.5k tokens per step. Code is released.

Significance. If the internalization mechanism is shown to produce genuine parameter-level skill acquisition rather than distribution adaptation, the work could reduce retrieval overhead and token costs in LLM agents while improving autonomy. The open-source code is a clear strength for reproducibility and follow-up verification.

major comments (2)

[Experiments / Dynamic Curriculum description] The central internalization claim rests on the Dynamic Curriculum's on-policy helpfulness metric driving progressive withdrawal. No ablation is reported that compares this dynamic selection against a fixed schedule or random withdrawal (e.g., in the experimental results or §4). Without such a control, it remains unclear whether reported gains reflect parameter encoding of skills or merely curriculum-induced shifts in the training distribution.
[Abstract and results tables] The abstract and results report concrete percentage improvements but provide no variance estimates, statistical significance tests, number of runs, or detailed baseline configurations. This information is required to evaluate whether the gains support the claim of reliable skill internalization over the RL baseline.

minor comments (2)

[Abstract] Abstract contains a typo: 'teaching he model' should read 'teaching the model'.
[Abstract] Abstract formatting: missing space in '+10.1% for WebShop' after the preceding comma.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments point by point below, committing to revisions that strengthen the evidence for the Dynamic Curriculum's role and improve the statistical reporting of results.

read point-by-point responses

Referee: [Experiments / Dynamic Curriculum description] The central internalization claim rests on the Dynamic Curriculum's on-policy helpfulness metric driving progressive withdrawal. No ablation is reported that compares this dynamic selection against a fixed schedule or random withdrawal (e.g., in the experimental results or §4). Without such a control, it remains unclear whether reported gains reflect parameter encoding of skills or merely curriculum-induced shifts in the training distribution.

Authors: We agree that an ablation isolating the on-policy helpfulness metric from simpler withdrawal strategies would strengthen the internalization claim. The current results show SKILL0 outperforming the RL baseline, but without the requested controls it is difficult to fully attribute gains to parameter encoding versus training distribution effects. In the revised manuscript we will add this ablation to §4, comparing dynamic selection against both a fixed linear schedule and random withdrawal under the same token budget, and report the resulting zero-shot performance differences on ALFWorld, Search-QA, and WebShop. revision: yes
Referee: [Abstract and results tables] The abstract and results report concrete percentage improvements but provide no variance estimates, statistical significance tests, number of runs, or detailed baseline configurations. This information is required to evaluate whether the gains support the claim of reliable skill internalization over the RL baseline.

Authors: We accept that variance, run counts, and statistical tests are necessary to substantiate the reliability of the reported gains. The manuscript currently omits these details. In the revision we will update the abstract and all results tables to report means and standard deviations across five independent random seeds, include paired t-test p-values against the RL baseline, and expand the experimental setup section with precise baseline hyper-parameters and training configurations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are externally measured

full rationale

The paper describes SKILL0 as an in-context RL framework whose core mechanism is a Dynamic Curriculum that starts with full skill context and progressively withdraws it based on on-policy helpfulness until zero-shot operation. Reported gains (+9.7% ALFWorld, +6.6% Search-QA, +10.1% WebShop) are presented as measured experimental outcomes against a standard RL baseline, not as quantities defined by construction from fitted parameters or self-referential equations. No load-bearing derivation step reduces to self-definition, fitted-input renaming, or a self-citation chain; the curriculum is a procedural training schedule whose final performance is evaluated externally. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard RL assumptions plus the effectiveness of the proposed curriculum mechanisms; no new physical entities are introduced.

free parameters (1)

linear decay budget schedule
The rate and shape of the skill retention budget decay is a design choice that must be selected to achieve the reported zero-shot behavior.

axioms (2)

domain assumption Skills grouped offline by category can be rendered into compact visual context that supports learning
Invoked when describing how interaction history and skills are combined for training.
domain assumption On-policy helpfulness evaluation reliably identifies skills worth retaining
Central to the Dynamic Curriculum step that decides retention within the decaying budget.

pith-pipeline@v0.9.0 · 5810 in / 1375 out tokens · 72175 ms · 2026-05-19T17:37:51.500477+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamic Curriculum evaluates each skill file’s on-policy helpfulness by comparing agent performance with and without it... until the agent operates in a fully zero-shot setting.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear decay of the skill budget M(s) ... M(s) = ceil(N * (NS - s) / (NS - 1))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
cs.AI 2026-05 unverdicted novelty 6.0

PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
cs.CL 2026-05 unverdicted novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and Se...
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
cs.AI 2026-05 unverdicted novelty 5.0

Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...
Learning CLI Agents with Structured Action Credit under Selective Observation
cs.AI 2026-05 unverdicted novelty 5.0

CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 5.0

A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 10 Pith papers

[1]

If any required knowledge is missing or uncertain, youMUSTcall a search engine to get more external information using format:<search> your query </search>

work page
[2]

Additionally, select an image compression factor larger than 1.0 for the next image

Only if you have sufficient information to answer the question with high confidence, provide your final answer within<answer> </answer>tags. Additionally, select an image compression factor larger than 1.0 for the next image. Higher compression lowers cost, but too much compression harms image quality. You must provide the next compression factor within <...

work page
[3]

2.<search>...</search>or<answer>...</answer> 3.<compression>...</compression> Figure 12: Prompt template used by SKILL0 for the Search-based QA task environment

Reasoning: state what you found in the image. 2.<search>...</search>or<answer>...</answer> 3.<compression>...</compression> Figure 12: Prompt template used by SKILL0 for the Search-based QA task environment. 19 Table 7:Representative Skills inSkillBank. Skill Title Principle (Actionable Pattern) When to Apply skills/ALFWorld/general.md Systematic Explorat...

work page

[1] [1]

If any required knowledge is missing or uncertain, youMUSTcall a search engine to get more external information using format:<search> your query </search>

work page

[2] [2]

Additionally, select an image compression factor larger than 1.0 for the next image

Only if you have sufficient information to answer the question with high confidence, provide your final answer within<answer> </answer>tags. Additionally, select an image compression factor larger than 1.0 for the next image. Higher compression lowers cost, but too much compression harms image quality. You must provide the next compression factor within <...

work page

[3] [3]

2.<search>...</search>or<answer>...</answer> 3.<compression>...</compression> Figure 12: Prompt template used by SKILL0 for the Search-based QA task environment

Reasoning: state what you found in the image. 2.<search>...</search>or<answer>...</answer> 3.<compression>...</compression> Figure 12: Prompt template used by SKILL0 for the Search-based QA task environment. 19 Table 7:Representative Skills inSkillBank. Skill Title Principle (Actionable Pattern) When to Apply skills/ALFWorld/general.md Systematic Explorat...

work page