pith. sign in

arxiv: 2604.15097 · v2 · pith:3UMECFYBnew · submitted 2026-04-16 · 💻 cs.SE · cs.CL

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Pith reviewed 2026-05-10 10:50 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords experience representationtest-time evolutioncode solvingAI agentsiterative improvementgene representationskill fragments
0
0 comments X

The pith

A compact Gene representation for reusable experience outperforms documentation-heavy Skill packages in guiding AI code solvers and enabling iterative evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how experience should be encoded so that it can serve as reliable test-time control and as material for ongoing improvement in AI systems. Across thousands of controlled trials on scientific code problems, documentation-style Skill packages prove unstable because their signal is sparse and expanding them often degrades results. A compact Gene format instead delivers the best average performance, holds up under changes to its structure, and supports stronger gains when failure history is attached. This matters because the bottleneck is not the volume of experience but the encoding that lets systems actually use and evolve it.

Core claim

Representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from

What carries the argument

The Gene, a compact editable structure that functions simultaneously as test-time control signal and substrate for iterative accumulation of experience.

If this is right

  • Gene-evolved systems reach higher success rates on code-solving tasks than their base models.
  • Failure history improves performance more when carried by Gene than by Skill packages or raw text.
  • Distilling failures into compact warnings outperforms simply appending full failure traces.
  • Adding documentation material to a compact Gene usually reduces rather than increases effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same representation choice may matter for experience reuse in non-code domains such as planning or tool use.
  • Agents could maintain and evolve populations of Genes across repeated interactions rather than accumulating procedural fragments.
  • Benchmarks that vary only representation format while holding total token budget fixed would isolate the effect more cleanly.

Load-bearing premise

That the 45 scientific code-solving scenarios and the chosen definitions of Gene versus Skill fragments are representative of broader experience-reuse needs, and that observed differences arise primarily from representation format.

What would settle it

A replication on a substantially larger or more diverse set of tasks in which Skill formats consistently produce higher success rates than Gene formats would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.15097 by Haoyang Zhang, Junjie Wang, Yiming Ren.

Figure 1
Figure 1. Figure 1: Experience units for test-time control and their representative performance. (a) Skill as [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scenario-level checkpoint distribution of the benchmark used in this paper. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Skill’s control value is sparse, whereas Gene remains stronger even under matched budget. (a) Decomposing Skill shows that only a narrow procedural slice is clearly useful, while several sections are neutral or harmful. (b) Under an approximately matched budget, shortened Skill fragments improve substantially, yet still remain below Gene. as well as Skill-QuickRef, Skill-ErrorHandling, and Skill-Pitfalls. … view at source ↗
Figure 4
Figure 4. Figure 4: Gene is substantially more sensitive to content corruption than to structural distortion. Wrong algorithm and wrong domain reduce performance on both Pro and Flash, whereas inverted priority remains competitive and overconstrained guidance even improves over clean Gene in this setting. This suggests that Gene’s effect is not tied to one fixed surface form, but depends more strongly on whether the encoded e… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy (%) on the CritPt benchmark. Two gene-evolved systems, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper reports results from 4,590 controlled trials across 45 scientific code-solving scenarios. It claims that documentation-oriented Skill packages yield unstable control, while a compact Strategy Gene representation delivers the strongest overall average performance, remains competitive under structural perturbations, outperforms matched-budget Skill fragments, and serves as a superior carrier for iterative experience accumulation (e.g., failure history is more effective when distilled into compact warnings). Reattaching documentation-oriented material typically weakens results. Concrete gains are reported on CritPt, where gene-evolved systems improve paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. The work concludes that the core issue in experience reuse is encoding as compact, control-oriented, evolution-ready objects rather than supplying more experience.

Significance. If the results prove robust, the work provides large-scale empirical evidence that representation format is a first-order factor in test-time experience reuse for LLM-based code solvers. It shifts emphasis from volume of experience to structured, editable, compact encodings that support both immediate control and iterative evolution. The scale of the evaluation (4,590 trials) and the concrete percentage gains on held-out scenarios are notable strengths that could inform agent design in software engineering and related domains.

major comments (1)
  1. [Abstract / Experimental Setup] The central claim that performance gains arise from the Gene representation (rather than incidental prompt differences) is load-bearing, yet the abstract only states that Gene 'outperforms matched-budget Skill fragments' without detailing how total prompt length, token allocation, structural framing, delimiters, ordering, and auxiliary instructions are held identical across conditions. If Skill fragments are rendered as fuller documentation-style text while Gene remains compact, the reported lifts (e.g., CritPt from 9.1% to 18.57%) could be artifacts of prompt engineering. Explicit confirmation or ablation of these controls is required in the methods section.
minor comments (1)
  1. [Abstract] The figure '4.590' in the abstract is a typographical error and should read '4,590' for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern about prompt equivalence controls is well-taken and directly relevant to the load-bearing claim. We address it below and will strengthen the manuscript with additional documentation.

read point-by-point responses
  1. Referee: [Abstract / Experimental Setup] The central claim that performance gains arise from the Gene representation (rather than incidental prompt differences) is load-bearing, yet the abstract only states that Gene 'outperforms matched-budget Skill fragments' without detailing how total prompt length, token allocation, structural framing, delimiters, ordering, and auxiliary instructions are held identical across conditions. If Skill fragments are rendered as fuller documentation-style text while Gene remains compact, the reported lifts (e.g., CritPt from 9.1% to 18.57%) could be artifacts of prompt engineering. Explicit confirmation or ablation of these controls is required in the methods section.

    Authors: We agree that the abstract is concise and that the methods section should provide explicit confirmation. In the full manuscript the matched-budget condition is implemented by selecting or truncating Skill fragments so that their token count (measured with the same tokenizer) lies within 5% of the Gene length for each scenario; the base prompt template, system instructions, query framing, delimiters, and output constraints are identical across all conditions, with the experience block inserted at the same position. We will add a dedicated subsection in Methods titled 'Prompt Equivalence and Token Budget Controls' that reports the per-scenario token budgets, the fragment-selection procedure, and a new ablation in which we deliberately mismatch lengths to isolate the effect of representation format. This revision will make the controls fully transparent and rule out prompt-engineering artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of representations via measured outcomes

full rationale

The paper conducts 4,590 controlled trials on 45 held-out scientific code-solving scenarios and directly measures performance differences between Gene and Skill representations (e.g., CritPt lifts from 9.1% to 18.57%). No equations, derivations, fitted parameters, or self-citations are invoked to reduce any claimed result to its own inputs by construction. All reported gains are external measurements on independent test cases rather than tautological re-expressions of fitted quantities or prior self-referential theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the chosen 45 scenarios adequately sample the space of scientific code problems and that the Gene representation can be consistently instantiated across models without additional hidden parameters.

axioms (1)
  • domain assumption The 45 code-solving scenarios are representative of the broader class of tasks where experience reuse matters.
    Invoked when generalizing from the reported averages to the claim that Gene is the better carrier for experience.
invented entities (1)
  • Strategy Gene no independent evidence
    purpose: Compact, editable carrier for experience that supports both one-shot control and iterative evolution.
    New object introduced to contrast with Skill packages; no independent falsifiable prediction outside the reported trials is supplied.

pith-pipeline@v0.9.0 · 5541 in / 1277 out tokens · 25039 ms · 2026-05-10T10:50:00.857462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    MOSS performs source-level self-rewriting in agent systems and raised OpenClaw four-task mean score from 0.25 to 0.61 in one cycle.

  2. Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

    cs.AI 2026-05 unverdicted novelty 7.0

    Identifies library drift as a failure mode in self-evolving LLM skill libraries and shows a governance recipe improves pass@1 from 0.258 to 0.584 on MBPP+ hard-100.

  3. Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

    cs.AI 2026-05 conditional novelty 6.0

    Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.

  4. SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

    cs.CL 2026-05 unverdicted novelty 5.0

    SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.