pith. sign in

arxiv: 2602.10085 · v3 · pith:RUA3FFJKnew · submitted 2026-02-10 · 💻 cs.AI

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Pith reviewed 2026-05-22 10:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords skill discoveryhierarchical rewardsfoundation modelsreinforcement learningautonomous learningopen-ended evolutionlong-horizon tasks
0
0 comments X

The pith

CODE-SHARP lets foundation models generate hierarchical Python reward programs so agents can discover and master skills from scratch without any human-designed rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CODE-SHARP as a way to let foundation models continuously create and refine an archive of Python programs called SHARPs. Each SHARP defines a local success condition for a skill and lists prerequisites that point to earlier SHARPs, so that at training time the system routes the agent through a chain of rewards and the agent only has to learn the new piece of behavior. This produces fully autonomous reinforcement learning in environments such as Craftax and XLand, where the resulting agents reach much higher performance than earlier methods and are the first to complete advanced actions like mining diamonds. The approach matters because it removes the need for hand-crafted reward functions or task curricula when building agents that keep expanding their own capabilities.

Core claim

CODE-SHARP leverages FMs to open-endedly grow and evolve an archive of Python programs encoding skills to train a generalist agent policy entirely from scratch via reinforcement learning, directly from source code. These programs, termed Skills as Hierarchical Reward Programs (SHARPs), each encode a local success condition and a set of prerequisites delegated to previously discovered SHARPs. At runtime, SHARPs dynamically route the agent through their prerequisite chain based on the current state, rewarding each completion along the way, requiring the agent to learn only the marginal behaviour each new SHARP introduces, enabling efficient learning of long-horizon skills without any pre-

What carries the argument

SHARPs, Python programs that pair a local success condition with a prerequisite list pointing to earlier programs, which at runtime dynamically route the agent and deliver incremental rewards so only the newest skill segment must be learned.

If this is right

  • On Craftax-Classic the trained agents reach six times the median performance of prior methods.
  • On XLand the same agents reach 2.6 times the median performance of prior methods.
  • The agents become the only ones able to craft iron tools and mine diamonds in the tested environments.
  • Scaling to Craftax-Extended produces a generalist policy over more than 90 discovered SHARPs that solves long-horizon tasks zero-shot at the level of agents given ground-truth rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generated program archive could be reused across new environments to avoid re-engineering rewards from scratch.
  • If the number of discovered SHARPs grows with task difficulty, the method could support continual expansion of agent capabilities without external task lists.
  • Physical robots might receive the same hierarchical reward chains to acquire sequences of manipulation skills with minimal human reward design.
  • Periodic pruning of low-utility SHARPs could keep the archive manageable as the number of programs increases.

Load-bearing premise

The foundation model can keep producing valid, non-redundant SHARP programs whose prerequisite chains create useful incremental learning signals without any human curation or filtering.

What would settle it

An experiment in which the foundation model is replaced by one that produces mostly invalid or duplicate SHARPs and the resulting agent shows no improvement over standard RL baselines on diamond-mining success rate in Craftax.

Figures

Figures reproduced from arXiv: 2602.10085 by Antoine Cully, Pierluigi Vito Amadori, Richard Bornemann.

Figure 1
Figure 1. Figure 1: CODE-SHARP consists of two FM-driven iterative processes to discover novel SHARP skills and refine SHARP skills already present in the skill archive. CODE-SHARP utilises a pipeline of FM-based skill proposal generator, implementor, and judge to first generate and filter novel SHARP skills before environment evaluation. Skill refinement is based on the FM-based skill mutation generator and implementor. Skil… view at source ↗
Figure 2
Figure 2. Figure 2: Pseudo-Code version of the SHARP skill defining a skill to craft a stone pickaxe. Skill Proposal Genera￾tor The proposal gener￾ator produces a set of n skill candidates for￾matted as pseudo-code. Each proposal specifies a high-level description, a binary success con￾dition ϕ, and a dictio￾nary mapping environ￾ment conditions to pre￾requisite SHARP skills in the existing archive. The skill proposal generato… view at source ↗
Figure 3
Figure 3. Figure 3: Interconnected archive of discovered SHARP skills. CODE-SHARP continuously builds on existing SHARP skills in the archive to define novel, meaningful skills in line with the natural curriculum of Craftax. Initial skill discovery focuses on the Overworld before progressing to the Dungeon then the Mines and finally the Sewers. for a total of 2e9 environment steps. The agent architecture is a JAX (Bradbury et… view at source ↗
Figure 4
Figure 4. Figure 4: ((a) Average score achieved on each benchmark task. CODE-SHARP outperforms the zero-shot ReAct LLM agent, the agent pretrained on environment rewards, and the task experts. (b) Evolution of agent capabilities over the course of open-ended skill discovery. The policy planner utilises increasingly complex SHARP skills to define policies-in-code throughout training, resulting in large performance gains relati… view at source ↗
Figure 5
Figure 5. Figure 5: shows the evolution of the absolute score achieved by the goal-conditioned agent guided by the policies-in-code as the skill archive evolves. We observe large increases in performance for the Dungeon and Crafting benchmarks which are focused on the first two levels of Craftax. Performance on the Navigation and Mines benchmarks, which are focused on the later levels of Craftax, continue to increase steadily… view at source ↗
Figure 6
Figure 6. Figure 6: further illustrates the impact of these components across individual benchmark tasks. We observe that opportunistic sampling is critical for mastering complex, long-horizon tasks. While all ablations contribute to the agent’s success, the data suggest that opportunistic sampling, by dynamically shifting the training distribution toward the frontier of the agent’s capabilities, provides the largest singular… view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of average SHARP skill complexity present in the skill archive 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

A core quality of general intelligence is the ability to open-endedly expand and evolve its set of mastered skills autonomously. While recent Foundation Model (FM) driven approaches have shown promising results towards this goal, they typically rely on significant human-in-the-loop engineering, limiting their transferability to novel environments. To address this, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a framework that leverages FMs to open-endedly grow and evolve an archive of Python programs encoding skills to train a generalist agent policy entirely from scratch via reinforcement learning, directly from source code. These programs, termed Skills as Hierarchical Reward Programs (SHARPs), each encode a local success condition and a set of prerequisites delegated to previously discovered SHARPs. At runtime, SHARPs dynamically route the agent through their prerequisite chain based on the current state, rewarding each completion along the way, requiring the agent to learn only the marginal behaviour each new SHARP introduces, enabling efficient learning of long-horizon skills without any pre-defined rewards. On Craftax-Classic and XLand, agents trained fully autonomously by CODE-SHARP outperform previous works by 6x and 2.6x in median performance and are the only agents capable of crafting iron tools and mining diamonds. Scaled to Craftax-Extended, CODE-SHARP trains a generalist agent on over 90 discovered SHARPs, enabling the agent to solve challenging long-horizon tasks zero-shot, matching agents trained on ground-truth rewards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CODE-SHARP, a framework that uses foundation models to continuously and autonomously discover and evolve an archive of Skills as Hierarchical Reward Programs (SHARPs). Each SHARP is a Python program encoding a local success condition together with a chain of prerequisite SHARPs; at runtime the agent is routed through the chain and receives incremental rewards only for the marginal behavior introduced by the new program. The method is evaluated on Craftax-Classic, XLand, and Craftax-Extended, where it reports 6× and 2.6× median performance gains over prior work, the first successful iron-tool and diamond-mining agents, and zero-shot long-horizon solving with >90 discovered SHARPs that matches ground-truth-reward performance.

Significance. If the autonomy and validity claims are substantiated, the work would constitute a meaningful step toward open-ended, reward-free skill acquisition in reinforcement learning. The hierarchical program representation and dynamic routing mechanism offer a concrete route to scaling generalist agents on long-horizon tasks without hand-crafted reward functions.

major comments (2)
  1. [Abstract] Abstract: the headline performance claims (6× median on Craftax-Classic, 2.6× on XLand, unique iron-tool and diamond-mining capability) are presented without any report of the number of independent runs, statistical significance tests, variance across seeds, or the precise baseline implementations and hyper-parameters used for comparison. These details are required to assess whether the reported gains are robust.
  2. [Abstract and paragraph on SHARP generation and runtime routing] Abstract and paragraph on SHARP generation and runtime routing: the central claim that training occurs 'fully autonomously' and 'without any pre-defined rewards' rests on the assumption that the foundation model produces executable, non-redundant SHARPs whose prerequisite chains yield useful incremental signals. The manuscript provides no acceptance rate, cycle-detection procedure, redundancy-pruning method, or verification that invalid programs are never inserted into the archive; without these quantities it is impossible to rule out that performance derives from an implicitly curated subset rather than raw open-ended discovery.
minor comments (1)
  1. [Abstract] The acronym SHARP is used in the abstract before its expansion; a parenthetical definition on first use would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We provide point-by-point responses to the major comments and indicate the revisions we plan to incorporate in the updated manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance claims (6× median on Craftax-Classic, 2.6× on XLand, unique iron-tool and diamond-mining capability) are presented without any report of the number of independent runs, statistical significance tests, variance across seeds, or the precise baseline implementations and hyper-parameters used for comparison. These details are required to assess whether the reported gains are robust.

    Authors: We agree with this observation. The current abstract highlights key results but omits important statistical details. In the revised version, we will update the abstract to include the number of independent runs (e.g., 5 seeds), report median and interquartile ranges, and mention that statistical significance was assessed using appropriate tests. We will also add a table or section detailing baseline implementations and hyperparameters to ensure reproducibility and robustness assessment. revision: yes

  2. Referee: [Abstract and paragraph on SHARP generation and runtime routing] Abstract and paragraph on SHARP generation and runtime routing: the central claim that training occurs 'fully autonomously' and 'without any pre-defined rewards' rests on the assumption that the foundation model produces executable, non-redundant SHARPs whose prerequisite chains yield useful incremental signals. The manuscript provides no acceptance rate, cycle-detection procedure, redundancy-pruning method, or verification that invalid programs are never inserted into the archive; without these quantities it is impossible to rule out that performance derives from an implicitly curated subset rather than raw open-ended discovery.

    Authors: We appreciate the referee highlighting the need for more details on the autonomy mechanisms. While the framework operates without human intervention after initialization, we recognize that explicit descriptions of filtering processes are necessary. In the revision, we will add a new subsection under Methods describing the SHARP validation pipeline, including acceptance rates observed during experiments, cycle detection via topological sorting on the prerequisite graph, redundancy pruning based on program equivalence checks, and runtime verification that only valid, executable programs are added to the archive. This will substantiate that the performance gains stem from the open-ended discovery process. revision: yes

Circularity Check

0 steps flagged

No significant circularity: results validated on external benchmarks

full rationale

The paper's central claims rest on empirical performance metrics obtained from independent game environments (Craftax-Classic, XLand, Craftax-Extended) and direct comparisons to previously published baselines. No equations or derivations reduce the reported performance gains (6x median, diamond-mining capability, zero-shot long-horizon solving) to fitted parameters or self-referential definitions. SHARP generation and routing are described as autonomous processes whose outputs are evaluated externally rather than being tautologically equivalent to the inputs by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested assumption that foundation models can produce a growing set of useful, non-circular SHARP programs whose hierarchical structure yields effective incremental rewards; no free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Foundation models can generate executable Python programs that correctly encode local success conditions and prerequisite chains for novel skills
    Invoked throughout the description of autonomous SHARP discovery and runtime routing.
invented entities (1)
  • SHARP (Skill as Hierarchical Reward Program) no independent evidence
    purpose: To represent each discovered skill as a Python program that supplies a local reward and delegates prerequisites to earlier programs
    Core new representation introduced to enable incremental learning without predefined global rewards.

pith-pipeline@v0.9.0 · 5815 in / 1359 out tokens · 44492 ms · 2026-05-22T10:31:49.411745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Code as Policies: Language Model Programs for Embodied Control

    URL https://openreview.net/forum? id=jRjvcqtdtA. Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753, 2022. Liang, W., Wang, S., Wang, H.-J., Bastani, O., Jayaraman, D., and Ma, Y . J. Eurekaverse: Environment curricu- lum ...

  2. [2]

    Ensure the proposed skill is novel

    Propose a Single Skill: Your proposal must be for exactly one novel and diverse skill that fits within the provided category. Ensure the proposed skill is novel. Simple repetitions of existing skills, e.g. MineThreeWood when MineWood is present, are not acceptable. Novelty implies a functional difference, not just a parametric one. \\

  3. [3]

    Consider the logical order of skill acquisition to maximize the agent’s potential success

    Build Upon Existing Skills: The proposed skill must expand the agent’s current repertoire by building on existing skills. Consider the logical order of skill acquisition to maximize the agent’s potential success. The prerequisite skills used to define the proposed skill must be present in the archive. \\

  4. [4]

    Continuously build on already learned skills to form a curriculum of increasingly complex skills

    Form a Curriculum: Ensure that you start out with simple skills. Continuously build on already learned skills to form a curriculum of increasingly complex skills. The curriculum should be as easy as possible for the agent to follow. \\ 25 CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

  5. [5]

    Skill Exploration: The ultimate goal is for the agent to possess as many diverse and meaningful skills as possible. For this it is crucial to explore as many levels as possible in the environment but also thoroughly explore each level for any meaningful skills proposals not yet present in the agent’s skill repertoire. \\

  6. [6]

    For crafting tools the success condition must be given as absolute values, i.e

    Define Success Condition: Clearly define a condition that indicates the skill has been successfully completed. For crafting tools the success condition must be given as absolute values, i.e. for wood pickaxe it must be cur.inventory.pickaxe $ >= 1$. For navigation skills, ensure the success condition includes the level the agent should be on to fulfill th...

  7. [7]

    No implementation for a condition needed

    State Starting Conditions: Carefully review the provided environment code to identify any necessary starting conditions. State these conditions using the provided template. If no specific conditions are needed, you must write "No implementation for a condition needed". \\

  8. [8]

    Carefully analyse the environment and the logic of the skill to decide the order in which you place the conditions

    Condition Order: The agent will complete each condition in the order that is specified in the skill definition. Carefully analyse the environment and the logic of the skill to decide the order in which you place the conditions

  9. [9]

    If a necessary prerequisite skill does not exist, you must disregard your initial idea and instead propose the missing prerequisite skill

    Link Prerequisites: If a starting condition is required, you must name a prerequisite skill from the agent’s repertoire that can fulfill it. If a necessary prerequisite skill does not exist, you must disregard your initial idea and instead propose the missing prerequisite skill. \\

  10. [10]

    Assign Reward: Assign a reward of one to the skill. \\

  11. [11]

    Skill Name

    Proposal History: Carefully go over previously failed skill proposals, analyse why they might not have achieved a high enough performance to be accepted and use them as potential inspiration for new proposals. Never directly repropose one of the failed skills. You can propose a skill with the same objective but it must use a different structure so as to a...

  12. [12]

    If it’s fundamentally flawed and cannot be fixed, reject it and explain why

    Validate the Proposal ------------------------- First, check the proposal for correctness. If it’s fundamentally flawed and cannot be fixed, reject it and explain why. * Success Condition: - Ensure it’s logical and unambiguous. - Correct minor logical errors. - For crafting tools the success condition must be given as absolute values, i.e. for wood pickax...

  13. [13]

    No implementation for a condition needed

    Implement the JAX Class -------------------------- If the proposal is valid (or you’ve corrected it), implement the class. * Indexing: - Use the proposed index. If it’s already taken, increment the index by one until it is unique. - Refer to prerequisite skills by their correct index. * Code Requirements: - Your entire implementation MUST be JAX and JIT-c...

  14. [14]

    If no skill passes your judgement you are allowed to reject them all

    **Selected Skill: ** At most two skills can be selected. If no skill passes your judgement you are allowed to reject them all

  15. [15]

    Filter out all skills which are would not be classified as novel

    **Filtering:** Start by comparing each skill proposal against the example skills. Filter out all skills which are would not be classified as novel. Then, make your final decision based on the set of novel skills

  16. [16]

    **Justification:** Provide a concise explanation for your choice, explicitly referencing how the selected skill excels in **Curriculum Coherence **, **Strategic Value**, and/or **Skill Diversity ** compared to the other candidates

  17. [17]

    **Output:** Provide the class of the skill which you choose as the optimal next skill to add to agents repertoire in the format provided to you

  18. [18]

    Collect Wood

    **Criteria:** The most important criteria your selected skill must possess is feasibility. Always pick a skill that presents a logical incremental improvement over extremely difficult skills for which the agent does not possess all prerequisite skills. ENVIRONMENT CODE ---------------- $environment_description$ SKILL FUNCTION TEMPLATE --------------------...

  19. [19]

    Ensure your mutations follow the exact heuristic specification given to you and is sensible

  20. [20]

    If you add a new new precondition function, ensure that a relevant skill to satisfy it is present in the agents skill archive

  21. [21]

    Directly follow the output template given to you to define your mutation proposal

  22. [22]

    All precondition functions should be clearly marked in the mutation preconditions

  23. [23]

    You should not directly reimplement one of the previous failed mutations

    Carefully analyse the previous failed mutation attempts, if available, to intelligently propose a next mutation. You should not directly reimplement one of the previous failed mutations

  24. [24]

    Skill Name

    Under no circumstances should you mutate the success condition of the parent skill. ENVIRONMENT DESCRIPTION $environment_description SKILL ARCHIVE 34 CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs $skill_repertoire PARENT SKILL $sampled_parent_skill $sampled_parent_skill_code PREVIOUSLY FAILED PROPOSED ...

  25. [25]

    Skill Structure ------------------ * ** Class Name: ** ‘BenchmarkSolver‘ * ** Index:** $next_skill_index * ** Template:** Use the provided JAX class structure

  26. [26]

    Mine Stone

    Logic & Strategy Requirements -------------------------------- You must define the ‘cond_fns‘ (preconditions) and ‘prereq_fns‘ (actions) lists. Construct them using the following logic: ### A. Strategic Planning Before strictly following the provided milestones, you must analyze the task requirements to ensure agent survival and efficiency. * ** Preparati...

  27. [27]

    Collect Wood

    Code Requirements -------------------- * ** JAX Compatibility: ** The implementation must be pure JAX and JIT-compatible. * ** Operators:** ALWAYS use ‘jnp.logical_and‘, ‘jnp.logical_or‘, etc., instead of Python native operators. * ** Imports:** Do not add import statements; assume the environment is pre-loaded. --------------------------------- CRAFTAX E...