From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Haoyang Zhang; Junjie Wang; Yiming Ren

arxiv: 2604.15097 · v2 · pith:3UMECFYBnew · submitted 2026-04-16 · 💻 cs.SE · cs.CL

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang , Yiming Ren , Haoyang Zhang This is my paper

Pith reviewed 2026-05-10 10:50 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords experience representationtest-time evolutioncode solvingAI agentsiterative improvementgene representationskill fragments

0 comments

The pith

A compact Gene representation for reusable experience outperforms documentation-heavy Skill packages in guiding AI code solvers and enabling iterative evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how experience should be encoded so that it can serve as reliable test-time control and as material for ongoing improvement in AI systems. Across thousands of controlled trials on scientific code problems, documentation-style Skill packages prove unstable because their signal is sparse and expanding them often degrades results. A compact Gene format instead delivers the best average performance, holds up under changes to its structure, and supports stronger gains when failure history is attached. This matters because the bottleneck is not the volume of experience but the encoding that lets systems actually use and evolve it.

Core claim

Representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from

What carries the argument

The Gene, a compact editable structure that functions simultaneously as test-time control signal and substrate for iterative accumulation of experience.

If this is right

Gene-evolved systems reach higher success rates on code-solving tasks than their base models.
Failure history improves performance more when carried by Gene than by Skill packages or raw text.
Distilling failures into compact warnings outperforms simply appending full failure traces.
Adding documentation material to a compact Gene usually reduces rather than increases effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same representation choice may matter for experience reuse in non-code domains such as planning or tool use.
Agents could maintain and evolve populations of Genes across repeated interactions rather than accumulating procedural fragments.
Benchmarks that vary only representation format while holding total token budget fixed would isolate the effect more cleanly.

Load-bearing premise

That the 45 scientific code-solving scenarios and the chosen definitions of Gene versus Skill fragments are representative of broader experience-reuse needs, and that observed differences arise primarily from representation format.

What would settle it

A replication on a substantially larger or more diverse set of tasks in which Skill formats consistently produce higher success rates than Gene formats would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.15097 by Haoyang Zhang, Junjie Wang, Yiming Ren.

**Figure 2.** Figure 2: Scenario-level checkpoint distribution of the benchmark used in this paper. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Skill’s control value is sparse, whereas Gene remains stronger even under matched budget. (a) Decomposing Skill shows that only a narrow procedural slice is clearly useful, while several sections are neutral or harmful. (b) Under an approximately matched budget, shortened Skill fragments improve substantially, yet still remain below Gene. as well as Skill-QuickRef, Skill-ErrorHandling, and Skill-Pitfalls. … view at source ↗

**Figure 4.** Figure 4: Gene is substantially more sensitive to content corruption than to structural distortion. Wrong algorithm and wrong domain reduce performance on both Pro and Flash, whereas inverted priority remains competitive and overconstrained guidance even improves over clean Gene in this setting. This suggests that Gene’s effect is not tied to one fixed surface form, but depends more strongly on whether the encoded e… view at source ↗

**Figure 5.** Figure 5: Accuracy (%) on the CritPt benchmark. Two gene-evolved systems, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds compact Gene representations beat matched-budget Skill fragments for test-time control and iterative evolution in code agents, but prompt construction details remain a possible confound.

read the letter

The main point here is that compact Strategy Gene encodings of experience deliver stronger one-shot control and better iterative gains than Skill fragments or documentation packages in these agent setups. Across 4,590 trials on 45 scientific code scenarios, the Gene format shows lifts like the reported jumps on CritPt, and it handles attached failure history more effectively when distilled into warnings rather than raw text or expanded docs. Adding more documentation often hurts rather than helps, which is a practical takeaway for anyone managing experience reuse at test time. The scale of the controlled trials and the focus on representation as a first-order factor are the clearest contributions. The work also demonstrates that editable structure in the Gene carrier matters beyond just the content. That said, the central comparison rests on the claim that conditions were matched beyond token budget. The abstract states matched-budget Skill fragments, yet without the exact prompt templates, delimiters, ordering, or auxiliary instructions, it is still possible that structural differences in how experience is injected explain part of the edge. The scenarios are all scientific code-solving, so the results may not transfer directly to other agent domains. This is the sort of empirical report that people working on agent memory and test-time adaptation will want to see. It deserves a serious referee because the trial volume is decent and the question is concrete, even though the methods section will need more detail on prompt equivalence and statistical controls before the numbers can be taken at face value.

Referee Report

1 major / 1 minor

Summary. The paper reports results from 4,590 controlled trials across 45 scientific code-solving scenarios. It claims that documentation-oriented Skill packages yield unstable control, while a compact Strategy Gene representation delivers the strongest overall average performance, remains competitive under structural perturbations, outperforms matched-budget Skill fragments, and serves as a superior carrier for iterative experience accumulation (e.g., failure history is more effective when distilled into compact warnings). Reattaching documentation-oriented material typically weakens results. Concrete gains are reported on CritPt, where gene-evolved systems improve paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. The work concludes that the core issue in experience reuse is encoding as compact, control-oriented, evolution-ready objects rather than supplying more experience.

Significance. If the results prove robust, the work provides large-scale empirical evidence that representation format is a first-order factor in test-time experience reuse for LLM-based code solvers. It shifts emphasis from volume of experience to structured, editable, compact encodings that support both immediate control and iterative evolution. The scale of the evaluation (4,590 trials) and the concrete percentage gains on held-out scenarios are notable strengths that could inform agent design in software engineering and related domains.

major comments (1)

[Abstract / Experimental Setup] The central claim that performance gains arise from the Gene representation (rather than incidental prompt differences) is load-bearing, yet the abstract only states that Gene 'outperforms matched-budget Skill fragments' without detailing how total prompt length, token allocation, structural framing, delimiters, ordering, and auxiliary instructions are held identical across conditions. If Skill fragments are rendered as fuller documentation-style text while Gene remains compact, the reported lifts (e.g., CritPt from 9.1% to 18.57%) could be artifacts of prompt engineering. Explicit confirmation or ablation of these controls is required in the methods section.

minor comments (1)

[Abstract] The figure '4.590' in the abstract is a typographical error and should read '4,590' for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern about prompt equivalence controls is well-taken and directly relevant to the load-bearing claim. We address it below and will strengthen the manuscript with additional documentation.

read point-by-point responses

Referee: [Abstract / Experimental Setup] The central claim that performance gains arise from the Gene representation (rather than incidental prompt differences) is load-bearing, yet the abstract only states that Gene 'outperforms matched-budget Skill fragments' without detailing how total prompt length, token allocation, structural framing, delimiters, ordering, and auxiliary instructions are held identical across conditions. If Skill fragments are rendered as fuller documentation-style text while Gene remains compact, the reported lifts (e.g., CritPt from 9.1% to 18.57%) could be artifacts of prompt engineering. Explicit confirmation or ablation of these controls is required in the methods section.

Authors: We agree that the abstract is concise and that the methods section should provide explicit confirmation. In the full manuscript the matched-budget condition is implemented by selecting or truncating Skill fragments so that their token count (measured with the same tokenizer) lies within 5% of the Gene length for each scenario; the base prompt template, system instructions, query framing, delimiters, and output constraints are identical across all conditions, with the experience block inserted at the same position. We will add a dedicated subsection in Methods titled 'Prompt Equivalence and Token Budget Controls' that reports the per-scenario token budgets, the fragment-selection procedure, and a new ablation in which we deliberately mismatch lengths to isolate the effect of representation format. This revision will make the controls fully transparent and rule out prompt-engineering artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of representations via measured outcomes

full rationale

The paper conducts 4,590 controlled trials on 45 held-out scientific code-solving scenarios and directly measures performance differences between Gene and Skill representations (e.g., CritPt lifts from 9.1% to 18.57%). No equations, derivations, fitted parameters, or self-citations are invoked to reduce any claimed result to its own inputs by construction. All reported gains are external measurements on independent test cases rather than tautological re-expressions of fitted quantities or prior self-referential theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the chosen 45 scenarios adequately sample the space of scientific code problems and that the Gene representation can be consistently instantiated across models without additional hidden parameters.

axioms (1)

domain assumption The 45 code-solving scenarios are representative of the broader class of tasks where experience reuse matters.
Invoked when generalizing from the reported averages to the claim that Gene is the better carrier for experience.

invented entities (1)

Strategy Gene no independent evidence
purpose: Compact, editable carrier for experience that supports both one-shot control and iterative evolution.
New object introduced to contrast with Skill packages; no independent falsifiable prediction outside the reported trials is supplied.

pith-pipeline@v0.9.0 · 5541 in / 1277 out tokens · 25039 ms · 2026-05-10T10:50:00.857462+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

MOSS performs source-level self-rewriting in agent systems and raised OpenClaw four-task mean score from 0.25 to 0.61 in one cycle.
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
cs.AI 2026-05 unverdicted novelty 7.0

Identifies library drift as a failure mode in self-evolving LLM skill libraries and shows a governance recipe improves pass@1 from 0.258 to 0.584 on MBPP+ hard-100.
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
cs.AI 2026-05 conditional novelty 6.0

Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
cs.CL 2026-05 unverdicted novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.