CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
Pith reviewed 2026-05-22 10:31 UTC · model grok-4.3
The pith
CODE-SHARP lets foundation models generate hierarchical Python reward programs so agents can discover and master skills from scratch without any human-designed rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CODE-SHARP leverages FMs to open-endedly grow and evolve an archive of Python programs encoding skills to train a generalist agent policy entirely from scratch via reinforcement learning, directly from source code. These programs, termed Skills as Hierarchical Reward Programs (SHARPs), each encode a local success condition and a set of prerequisites delegated to previously discovered SHARPs. At runtime, SHARPs dynamically route the agent through their prerequisite chain based on the current state, rewarding each completion along the way, requiring the agent to learn only the marginal behaviour each new SHARP introduces, enabling efficient learning of long-horizon skills without any pre-
What carries the argument
SHARPs, Python programs that pair a local success condition with a prerequisite list pointing to earlier programs, which at runtime dynamically route the agent and deliver incremental rewards so only the newest skill segment must be learned.
If this is right
- On Craftax-Classic the trained agents reach six times the median performance of prior methods.
- On XLand the same agents reach 2.6 times the median performance of prior methods.
- The agents become the only ones able to craft iron tools and mine diamonds in the tested environments.
- Scaling to Craftax-Extended produces a generalist policy over more than 90 discovered SHARPs that solves long-horizon tasks zero-shot at the level of agents given ground-truth rewards.
Where Pith is reading between the lines
- The same generated program archive could be reused across new environments to avoid re-engineering rewards from scratch.
- If the number of discovered SHARPs grows with task difficulty, the method could support continual expansion of agent capabilities without external task lists.
- Physical robots might receive the same hierarchical reward chains to acquire sequences of manipulation skills with minimal human reward design.
- Periodic pruning of low-utility SHARPs could keep the archive manageable as the number of programs increases.
Load-bearing premise
The foundation model can keep producing valid, non-redundant SHARP programs whose prerequisite chains create useful incremental learning signals without any human curation or filtering.
What would settle it
An experiment in which the foundation model is replaced by one that produces mostly invalid or duplicate SHARPs and the resulting agent shows no improvement over standard RL baselines on diamond-mining success rate in Craftax.
Figures
read the original abstract
A core quality of general intelligence is the ability to open-endedly expand and evolve its set of mastered skills autonomously. While recent Foundation Model (FM) driven approaches have shown promising results towards this goal, they typically rely on significant human-in-the-loop engineering, limiting their transferability to novel environments. To address this, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a framework that leverages FMs to open-endedly grow and evolve an archive of Python programs encoding skills to train a generalist agent policy entirely from scratch via reinforcement learning, directly from source code. These programs, termed Skills as Hierarchical Reward Programs (SHARPs), each encode a local success condition and a set of prerequisites delegated to previously discovered SHARPs. At runtime, SHARPs dynamically route the agent through their prerequisite chain based on the current state, rewarding each completion along the way, requiring the agent to learn only the marginal behaviour each new SHARP introduces, enabling efficient learning of long-horizon skills without any pre-defined rewards. On Craftax-Classic and XLand, agents trained fully autonomously by CODE-SHARP outperform previous works by 6x and 2.6x in median performance and are the only agents capable of crafting iron tools and mining diamonds. Scaled to Craftax-Extended, CODE-SHARP trains a generalist agent on over 90 discovered SHARPs, enabling the agent to solve challenging long-horizon tasks zero-shot, matching agents trained on ground-truth rewards.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CODE-SHARP, a framework that uses foundation models to continuously and autonomously discover and evolve an archive of Skills as Hierarchical Reward Programs (SHARPs). Each SHARP is a Python program encoding a local success condition together with a chain of prerequisite SHARPs; at runtime the agent is routed through the chain and receives incremental rewards only for the marginal behavior introduced by the new program. The method is evaluated on Craftax-Classic, XLand, and Craftax-Extended, where it reports 6× and 2.6× median performance gains over prior work, the first successful iron-tool and diamond-mining agents, and zero-shot long-horizon solving with >90 discovered SHARPs that matches ground-truth-reward performance.
Significance. If the autonomy and validity claims are substantiated, the work would constitute a meaningful step toward open-ended, reward-free skill acquisition in reinforcement learning. The hierarchical program representation and dynamic routing mechanism offer a concrete route to scaling generalist agents on long-horizon tasks without hand-crafted reward functions.
major comments (2)
- [Abstract] Abstract: the headline performance claims (6× median on Craftax-Classic, 2.6× on XLand, unique iron-tool and diamond-mining capability) are presented without any report of the number of independent runs, statistical significance tests, variance across seeds, or the precise baseline implementations and hyper-parameters used for comparison. These details are required to assess whether the reported gains are robust.
- [Abstract and paragraph on SHARP generation and runtime routing] Abstract and paragraph on SHARP generation and runtime routing: the central claim that training occurs 'fully autonomously' and 'without any pre-defined rewards' rests on the assumption that the foundation model produces executable, non-redundant SHARPs whose prerequisite chains yield useful incremental signals. The manuscript provides no acceptance rate, cycle-detection procedure, redundancy-pruning method, or verification that invalid programs are never inserted into the archive; without these quantities it is impossible to rule out that performance derives from an implicitly curated subset rather than raw open-ended discovery.
minor comments (1)
- [Abstract] The acronym SHARP is used in the abstract before its expansion; a parenthetical definition on first use would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We provide point-by-point responses to the major comments and indicate the revisions we plan to incorporate in the updated manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline performance claims (6× median on Craftax-Classic, 2.6× on XLand, unique iron-tool and diamond-mining capability) are presented without any report of the number of independent runs, statistical significance tests, variance across seeds, or the precise baseline implementations and hyper-parameters used for comparison. These details are required to assess whether the reported gains are robust.
Authors: We agree with this observation. The current abstract highlights key results but omits important statistical details. In the revised version, we will update the abstract to include the number of independent runs (e.g., 5 seeds), report median and interquartile ranges, and mention that statistical significance was assessed using appropriate tests. We will also add a table or section detailing baseline implementations and hyperparameters to ensure reproducibility and robustness assessment. revision: yes
-
Referee: [Abstract and paragraph on SHARP generation and runtime routing] Abstract and paragraph on SHARP generation and runtime routing: the central claim that training occurs 'fully autonomously' and 'without any pre-defined rewards' rests on the assumption that the foundation model produces executable, non-redundant SHARPs whose prerequisite chains yield useful incremental signals. The manuscript provides no acceptance rate, cycle-detection procedure, redundancy-pruning method, or verification that invalid programs are never inserted into the archive; without these quantities it is impossible to rule out that performance derives from an implicitly curated subset rather than raw open-ended discovery.
Authors: We appreciate the referee highlighting the need for more details on the autonomy mechanisms. While the framework operates without human intervention after initialization, we recognize that explicit descriptions of filtering processes are necessary. In the revision, we will add a new subsection under Methods describing the SHARP validation pipeline, including acceptance rates observed during experiments, cycle detection via topological sorting on the prerequisite graph, redundancy pruning based on program equivalence checks, and runtime verification that only valid, executable programs are added to the archive. This will substantiate that the performance gains stem from the open-ended discovery process. revision: yes
Circularity Check
No significant circularity: results validated on external benchmarks
full rationale
The paper's central claims rest on empirical performance metrics obtained from independent game environments (Craftax-Classic, XLand, Craftax-Extended) and direct comparisons to previously published baselines. No equations or derivations reduce the reported performance gains (6x median, diamond-mining capability, zero-shot long-horizon solving) to fitted parameters or self-referential definitions. SHARP generation and routing are described as autonomous processes whose outputs are evaluated externally rather than being tautologically equivalent to the inputs by construction. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Foundation models can generate executable Python programs that correctly encode local success conditions and prerequisite chains for novel skills
invented entities (1)
-
SHARP (Skill as Hierarchical Reward Program)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Code as Policies: Language Model Programs for Embodied Control
URL https://openreview.net/forum? id=jRjvcqtdtA. Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753, 2022. Liang, W., Wang, S., Wang, H.-J., Bastani, O., Jayaraman, D., and Ma, Y . J. Eurekaverse: Environment curricu- lum ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Ensure the proposed skill is novel
Propose a Single Skill: Your proposal must be for exactly one novel and diverse skill that fits within the provided category. Ensure the proposed skill is novel. Simple repetitions of existing skills, e.g. MineThreeWood when MineWood is present, are not acceptable. Novelty implies a functional difference, not just a parametric one. \\
-
[3]
Consider the logical order of skill acquisition to maximize the agent’s potential success
Build Upon Existing Skills: The proposed skill must expand the agent’s current repertoire by building on existing skills. Consider the logical order of skill acquisition to maximize the agent’s potential success. The prerequisite skills used to define the proposed skill must be present in the archive. \\
-
[4]
Continuously build on already learned skills to form a curriculum of increasingly complex skills
Form a Curriculum: Ensure that you start out with simple skills. Continuously build on already learned skills to form a curriculum of increasingly complex skills. The curriculum should be as easy as possible for the agent to follow. \\ 25 CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
-
[5]
Skill Exploration: The ultimate goal is for the agent to possess as many diverse and meaningful skills as possible. For this it is crucial to explore as many levels as possible in the environment but also thoroughly explore each level for any meaningful skills proposals not yet present in the agent’s skill repertoire. \\
-
[6]
For crafting tools the success condition must be given as absolute values, i.e
Define Success Condition: Clearly define a condition that indicates the skill has been successfully completed. For crafting tools the success condition must be given as absolute values, i.e. for wood pickaxe it must be cur.inventory.pickaxe $ >= 1$. For navigation skills, ensure the success condition includes the level the agent should be on to fulfill th...
-
[7]
No implementation for a condition needed
State Starting Conditions: Carefully review the provided environment code to identify any necessary starting conditions. State these conditions using the provided template. If no specific conditions are needed, you must write "No implementation for a condition needed". \\
-
[8]
Condition Order: The agent will complete each condition in the order that is specified in the skill definition. Carefully analyse the environment and the logic of the skill to decide the order in which you place the conditions
-
[9]
Link Prerequisites: If a starting condition is required, you must name a prerequisite skill from the agent’s repertoire that can fulfill it. If a necessary prerequisite skill does not exist, you must disregard your initial idea and instead propose the missing prerequisite skill. \\
-
[10]
Assign Reward: Assign a reward of one to the skill. \\
-
[11]
Proposal History: Carefully go over previously failed skill proposals, analyse why they might not have achieved a high enough performance to be accepted and use them as potential inspiration for new proposals. Never directly repropose one of the failed skills. You can propose a skill with the same objective but it must use a different structure so as to a...
-
[12]
If it’s fundamentally flawed and cannot be fixed, reject it and explain why
Validate the Proposal ------------------------- First, check the proposal for correctness. If it’s fundamentally flawed and cannot be fixed, reject it and explain why. * Success Condition: - Ensure it’s logical and unambiguous. - Correct minor logical errors. - For crafting tools the success condition must be given as absolute values, i.e. for wood pickax...
-
[13]
No implementation for a condition needed
Implement the JAX Class -------------------------- If the proposal is valid (or you’ve corrected it), implement the class. * Indexing: - Use the proposed index. If it’s already taken, increment the index by one until it is unique. - Refer to prerequisite skills by their correct index. * Code Requirements: - Your entire implementation MUST be JAX and JIT-c...
-
[14]
If no skill passes your judgement you are allowed to reject them all
**Selected Skill: ** At most two skills can be selected. If no skill passes your judgement you are allowed to reject them all
-
[15]
Filter out all skills which are would not be classified as novel
**Filtering:** Start by comparing each skill proposal against the example skills. Filter out all skills which are would not be classified as novel. Then, make your final decision based on the set of novel skills
-
[16]
**Justification:** Provide a concise explanation for your choice, explicitly referencing how the selected skill excels in **Curriculum Coherence **, **Strategic Value**, and/or **Skill Diversity ** compared to the other candidates
-
[17]
**Output:** Provide the class of the skill which you choose as the optimal next skill to add to agents repertoire in the format provided to you
-
[18]
**Criteria:** The most important criteria your selected skill must possess is feasibility. Always pick a skill that presents a logical incremental improvement over extremely difficult skills for which the agent does not possess all prerequisite skills. ENVIRONMENT CODE ---------------- $environment_description$ SKILL FUNCTION TEMPLATE --------------------...
-
[19]
Ensure your mutations follow the exact heuristic specification given to you and is sensible
-
[20]
If you add a new new precondition function, ensure that a relevant skill to satisfy it is present in the agents skill archive
-
[21]
Directly follow the output template given to you to define your mutation proposal
-
[22]
All precondition functions should be clearly marked in the mutation preconditions
-
[23]
You should not directly reimplement one of the previous failed mutations
Carefully analyse the previous failed mutation attempts, if available, to intelligently propose a next mutation. You should not directly reimplement one of the previous failed mutations
-
[24]
Under no circumstances should you mutate the success condition of the parent skill. ENVIRONMENT DESCRIPTION $environment_description SKILL ARCHIVE 34 CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs $skill_repertoire PARENT SKILL $sampled_parent_skill $sampled_parent_skill_code PREVIOUSLY FAILED PROPOSED ...
-
[25]
Skill Structure ------------------ * ** Class Name: ** ‘BenchmarkSolver‘ * ** Index:** $next_skill_index * ** Template:** Use the provided JAX class structure
-
[26]
Logic & Strategy Requirements -------------------------------- You must define the ‘cond_fns‘ (preconditions) and ‘prereq_fns‘ (actions) lists. Construct them using the following logic: ### A. Strategic Planning Before strictly following the provided milestones, you must analyze the task requirements to ensure agent survival and efficiency. * ** Preparati...
-
[27]
Code Requirements -------------------- * ** JAX Compatibility: ** The implementation must be pure JAX and JIT-compatible. * ** Operators:** ALWAYS use ‘jnp.logical_and‘, ‘jnp.logical_or‘, etc., instead of Python native operators. * ** Imports:** Do not add import statements; assume the environment is pre-loaded. --------------------------------- CRAFTAX E...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.