When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

Xiaohao Xu; Xiaonan Huang; Yang Li; Yunfei Wang

arxiv: 2605.25832 · v1 · pith:KXWGLEVOnew · submitted 2026-05-25 · 💻 cs.RO · cs.AI· cs.CL· cs.CV

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

Yunfei Wang , Xiaohao Xu , Yang Li , Xiaonan Huang This is my paper

Pith reviewed 2026-06-29 21:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CV

keywords robot morphology designLLM agentskill libraryevolutionary searchtransfer learningEvoGymnatural language rules

0 comments

The pith

Auto-Robotist turns robot design search results into an explicit natural-language skill library that improves initial performance and transfers to larger spaces better than genetic algorithms alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Auto-Robotist, a self-evolving LLM agent that distills the outcomes of morphology searches into a reusable skill library containing structural archetypes, positive and negative rules, and supporting designs. This library makes previously implicit search memory inspectable and retrievable, allowing the agent to condition LLM-generated edits on past results while still running a genetic algorithm path for exploration. After each round of evaluations, the library is updated through Add, Diagnose, and Merge steps. Experiments across seven EvoGym tasks show gains in cold-start 5x5 searches and successful transfer of the learned skills to 10x10 design spaces, where the reference-conditioned version beats GA on every task. The work therefore converts one-off simulator runs into persistent, auditable design principles.

Core claim

By distilling morphology-search traces into an explicit natural-language skill library that stores structural archetypes, evidence-grounded positive and negative rules, and the evaluated designs that support them, Auto-Robotist allows an LLM agent to retrieve skills that condition LLM edits of elite bodies while retaining a Genetic Algorithm mutation path; after evaluation the library is updated through Add, Diagnose, and Merge operations, producing better cold-start performance on 5x5 spaces and reference-conditioned transfer that outperforms GA on every 10x10 task tested.

What carries the argument

The skill library that stores structural archetype, evidence-grounded rules, and supporting designs so that retrieved entries can condition LLM proposals during search.

If this is right

Improves cold-start performance on 5x5 searches across locomotion, traversal, and object-interaction tasks.
Enables learned skills to transfer to 10x10 spaces where reference-conditioned search beats GA on every task.
Makes design memory explicit and inspectable rather than implicit inside a population.
Supports ongoing library updates through Add, Diagnose, and Merge after each evaluation round.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same library structure could be used to share design knowledge across entirely different robot tasks or simulators without retraining from scratch.
If the verbalized skills prove stable, the method could lower the total number of expensive evaluations needed by reusing past results across multiple design problems.
The approach might extend to other evolutionary search domains where trial outcomes can be turned into verbal rules, such as molecule or material design.
The inspectable library opens the possibility of human review or editing of the stored principles before they are reused.

Load-bearing premise

The assumption that LLM-generated natural-language skills accurately capture generalizable design principles without distortion or loss of critical information when retrieved and applied to new design spaces.

What would settle it

Transfer experiments on the 10x10 design spaces in which reference-conditioned transfer fails to outperform plain GA on any of the seven tasks.

Figures

Figures reproduced from arXiv: 2605.25832 by Xiaohao Xu, Xiaonan Huang, Yang Li, Yunfei Wang.

**Figure 1.** Figure 1: Search traces as reusable design knowledge. (A) Traditional search-based design is memoryless: many 5×5 morphologies are evaluated, but only the elite artifact is retained, leaving the evidence behind success and failure unused. (B) AUTO-ROBOTIST reflects over successful elites and rollout evidence to build a simulator-grounded natural-language skill library. Retrieved skills guide proposals for a new … view at source ↗

**Figure 2.** Figure 2: Overview of AUTO-ROBOTIST. (A) During design evolution, AUTO-ROBOTIST augments genetic mutation with skill-conditioned LLM proposals. Retrieved skills guide edits to elite parents, valid candidates are evaluated in EVOGYM with PPO-trained controllers, and the resulting fitness updates the elite pool. (B) During skill evolution, evaluation evidence is written back into a persistent library through ADD, DIAG… view at source ↗

**Figure 3.** Figure 3: Design evolution trace of AUTO-ROBOTIST on Carrier. Each point marks the best body after an evidence-guided library update, annotated with the skill or rule that motivated the next design edit. AUTO-ROBOTIST transforms a weak random seed into a load-bearing carrier by completing the active base, adding central webbing, forming a vertical spine, and densifying bridge support. The bottom rollouts show that t… view at source ↗

**Figure 4.** Figure 4: Best fitness vs. morphology evaluations. Top: 5×5 cold-start search, where AUTO-ROBOTIST learns from empty memory. Bottom: 5×5 → 10×10 transfer, where the learned library is reused in a larger space. The green curve removes the source-body exemplar, isolating skill transfer from visual imitation. Higher Pusher: Auto-Robotist Pusher: GA Jumper: GA Jumper: Auto-Robotist Walker: Genetic Algorithm (GA) Walker:… view at source ↗

**Figure 5.** Figure 5: Representative best 10×10 transfer rollouts. AUTO-ROBOTIST more often preserves support and contact through time, yielding faster movement, higher jumps, or farther object displacement than GA. helps because it transfers morphological relations– support around actuators, compliant contact backed by structure, balanced limb placement–that targetscale search can re-instantiate. Skill memory versus design co… view at source ↗

**Figure 6.** Figure 6: Elite robot designs across 5×5 and 10×10. Direct upsampling preserves geometry but often loses function; skill-guided search reconfigures source motifs into viable target-scale morphologies. ing compact hypotheses about load paths, contact surfaces, and actuator placement. 5 Conclusion We introduced AUTO-ROBOTIST, a self-evolving language agent that turns robot morphology search from disposable trial-and-e… view at source ↗

**Figure 7.** Figure 7: A learned Walker skill library. Evaluations are distilled into L1 archetypes, L2 positive/negative rules, and L3 supporting observations, exposing both reusable structures and grounded failure modes. extend the framework to 3D and hardware-facing morphologies, open-source LLM backbones, explicit rule-conflict resolution, calibrated uncertainty estimates, and stress tests for unsafe or brittle designs. Th… view at source ↗

**Figure 8.** Figure 8: Best fitness on Pusher per ablation condition. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns morphology search traces into an explicit, updatable natural-language skill library and tests its transfer from 5x5 to 10x10 EvoGym spaces.

read the letter

The main contribution is distilling search results into a skill library that holds structural archetypes, positive and negative rules, and supporting designs, then maintaining it with Add, Diagnose, and Merge steps. During new searches the library conditions LLM edits of elite bodies while a GA path stays available for exploration. This setup is tested on seven EvoGym tasks for both cold-start improvement and transfer to bigger design spaces, with the claim that reference-conditioned transfer beats plain GA on every task.

The framing makes the accumulated knowledge inspectable instead of implicit in a population, and the hybrid retrieval-plus-GA path is a practical way to keep exploration while reusing past work. The transfer experiments directly address whether the distilled rules carry value beyond the original search scale.

The abstract supplies no numbers, error bars, ablation results, or description of the retrieval mechanism, so the size of the reported gains and how much the library actually drives them are impossible to judge yet. The assumption that LLM-generated natural-language rules preserve generalizable design principles without important loss or distortion is central and untested in the visible material. Full results would need to show concrete use of the rules and rule out that other factors explain the outperformance.

This is for researchers working on LLM agents in evolutionary design or anyone trying to make repeated morphology searches less wasteful. A reader focused on cumulative optimization would find the update operations and transfer test worth examining even if the gains turn out modest. The work deserves peer review because the mechanism is distinct from standard memoryless loops and the experimental setup uses established benchmarks, though revisions will almost certainly be required for the quantitative details and implementation specifics.

Referee Report

2 major / 2 minor

Summary. The paper introduces Auto-Robotist, a self-evolving LLM agent for evolutionary robot morphology design in EvoGym. It distills search traces into an explicit natural-language skill library (storing archetypes, positive/negative rules, and supporting designs), retrieves skills to condition LLM edits of elite bodies during search (while retaining a GA mutation path), and updates the library via Add/Diagnose/Merge operations. Empirical claims include improved cold-start performance on 5x5 design spaces and successful transfer to 10x10 spaces, where reference-conditioned transfer outperforms GA across all seven tasks spanning locomotion, traversal, and object interaction.

Significance. If the transfer results hold with rigorous quantification, the work would be significant for demonstrating how LLM agents can convert expensive simulator evaluations into inspectable, reusable design principles rather than leaving knowledge implicit in populations. This addresses a gap in memoryless evolutionary design loops and could enable more efficient scaling to larger design spaces.

major comments (2)

[Abstract] Abstract: The central empirical claims of 'improves cold-start 5x5 search' and 'outperforms GA on every task' in 10x10 transfer are stated without any quantitative metrics, error bars, statistical tests, or ablation details on skill retrieval/application. This is load-bearing for the main result and prevents assessment of effect size or robustness.
[Method (skill library construction and retrieval)] The assumption that LLM-generated natural-language skills capture generalizable design principles without distortion is central to the transfer claim but receives no direct validation (e.g., via human review of skill fidelity or controlled ablations removing the skill library).

minor comments (2)

[Method] The description of the Add/Diagnose/Merge update rules would benefit from pseudocode or a flowchart to clarify the exact decision logic and prevent ambiguity in reproduction.
[Method] Clarify the exact mechanism for 'reference-conditioned transfer' (e.g., how retrieved skills are injected into the LLM prompt) to distinguish it from standard few-shot prompting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, agreeing where the manuscript is currently deficient and outlining concrete revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims of 'improves cold-start 5x5 search' and 'outperforms GA on every task' in 10x10 transfer are stated without any quantitative metrics, error bars, statistical tests, or ablation details on skill retrieval/application. This is load-bearing for the main result and prevents assessment of effect size or robustness.

Authors: We agree that the abstract presents the claims qualitatively and omits the supporting numbers, error bars, and statistical details needed to evaluate effect size. In the revised version we will rewrite the abstract to report concrete metrics (e.g., mean fitness improvement percentages with standard deviations across runs, and p-values from paired statistical tests versus GA) together with a brief statement of the key ablation outcomes on skill retrieval. revision: yes
Referee: [Method (skill library construction and retrieval)] The assumption that LLM-generated natural-language skills capture generalizable design principles without distortion is central to the transfer claim but receives no direct validation (e.g., via human review of skill fidelity or controlled ablations removing the skill library).

Authors: The observation is correct: the manuscript provides no direct validation of skill fidelity and relies solely on downstream transfer performance. We will add (1) a controlled ablation that disables skill retrieval while keeping all other components fixed and reports the resulting performance drop, and (2) a qualitative appendix that presents representative skills alongside the search-trace evidence that generated them, allowing readers to judge fidelity. A full-scale human review of every skill is not feasible within the current experimental budget, but the added ablation and example analysis will supply the requested direct evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical LLM-based agent (Auto-Robotist) that maintains an explicit skill library distilled from morphology search traces, then uses retrieval to condition edits during search on EvoGym tasks. All reported gains are framed as direct experimental comparisons against GA baselines in 5x5 and 10x10 design spaces. No equations, parameter fits, predictions, or uniqueness theorems appear; the method is self-contained as a procedural pipeline whose validity rests on observable transfer performance rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the unverified premise that LLMs can reliably extract and apply design rules from simulation traces; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption LLMs can distill simulation outcomes into accurate, generalizable natural-language design rules without significant hallucination or information loss
This is required for the skill library to function as claimed but is not demonstrated in the abstract.

invented entities (1)

Skill library no independent evidence
purpose: Store structural archetypes, evidence-based rules, and supporting designs as reusable memory
New construct introduced by the paper to make design knowledge explicit and transferable

pith-pipeline@v0.9.1-grok · 5735 in / 1271 out tokens · 44521 ms · 2026-06-29T21:22:30.248332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Jiawei Fang, Yuxuan Sun, Chengtian Ma, Qiuyu Lu, and Lining Yao

Dreamcoder: growing generalizable, inter- pretable knowledge with wake–sleep bayesian pro- gram learning.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engi- neering Sciences, 381(2251). Jiawei Fang, Yuxuan Sun, Chengtian Ma, Qiuyu Lu, and Lining Yao. 2025. Robomore: Llm-based robot co-design via joint optimization of morpho...

work page arXiv 2025
[2]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981

Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981. Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. 2023. Evolution through large models. InHandbook of evolutionary machine learning, pages 331–366. Springer. Muhan Li, Lingji Kong, and...

work page arXiv 2025
[3]

Ryan P Ringel, Zachary S Charlick, Jiaxun Liu, Boxi Xia, and Boyuan Chen

Quality diversity: A new frontier for evolu- tionary computation.Frontiers in Robotics and AI, 3:40. Ryan P Ringel, Zachary S Charlick, Jiaxun Liu, Boxi Xia, and Boyuan Chen. 2025. Text2robot: Evolu- tionary robot design from text descriptions. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 5789–5797. IEEE. Rana Salama, Ja...

2025
[4]

Proximal Policy Optimization Algorithms

Meminsight: Autonomous memory augmen- tation for llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 33124–33140. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Karl Sims. 2023. Evolving v...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

designs": [ {

Large language models as optimizers. InIn- ternational Conference on Learning Representations, volume 2024, pages 12028–12068. Ye Yuan, Yuda Song, Zhengyi Luo, Wen Sun, and Kris Kitani. 2021. Transform2act: Learning a transform- and-control policy for efficient agent design.arXiv preprint arXiv:2110.03659. A Prompt Templates This appendix lists the prompt...

work page arXiv 2024
[6]

If the skill has no L2 rules yet, still move the parent toward the L1 condition

Direction from the assigned skill: Use the skill's L1 condition as the target structural archetype for that slot. If the skill has no L2 rules yet, still move the parent toward the L1 condition
[7]

pos_0",

Tactics from L2 rules and exact-parent history: Use L2 positive rules as helpful sub-patterns when they fit this parent. Avoid L2 negative rules when relevant. Use exact-parent history to avoid repeats and avoid edits that already failed on this parent. Each history entry gives you raw evidence only: - the exact child_fitness achieved on this parent - the...

[1] [1]

Jiawei Fang, Yuxuan Sun, Chengtian Ma, Qiuyu Lu, and Lining Yao

Dreamcoder: growing generalizable, inter- pretable knowledge with wake–sleep bayesian pro- gram learning.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engi- neering Sciences, 381(2251). Jiawei Fang, Yuxuan Sun, Chengtian Ma, Qiuyu Lu, and Lining Yao. 2025. Robomore: Llm-based robot co-design via joint optimization of morpho...

work page arXiv 2025

[2] [2]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981

Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981. Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. 2023. Evolution through large models. InHandbook of evolutionary machine learning, pages 331–366. Springer. Muhan Li, Lingji Kong, and...

work page arXiv 2025

[3] [3]

Ryan P Ringel, Zachary S Charlick, Jiaxun Liu, Boxi Xia, and Boyuan Chen

Quality diversity: A new frontier for evolu- tionary computation.Frontiers in Robotics and AI, 3:40. Ryan P Ringel, Zachary S Charlick, Jiaxun Liu, Boxi Xia, and Boyuan Chen. 2025. Text2robot: Evolu- tionary robot design from text descriptions. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 5789–5797. IEEE. Rana Salama, Ja...

2025

[4] [4]

Proximal Policy Optimization Algorithms

Meminsight: Autonomous memory augmen- tation for llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 33124–33140. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Karl Sims. 2023. Evolving v...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

designs": [ {

Large language models as optimizers. InIn- ternational Conference on Learning Representations, volume 2024, pages 12028–12068. Ye Yuan, Yuda Song, Zhengyi Luo, Wen Sun, and Kris Kitani. 2021. Transform2act: Learning a transform- and-control policy for efficient agent design.arXiv preprint arXiv:2110.03659. A Prompt Templates This appendix lists the prompt...

work page arXiv 2024

[6] [6]

If the skill has no L2 rules yet, still move the parent toward the L1 condition

Direction from the assigned skill: Use the skill's L1 condition as the target structural archetype for that slot. If the skill has no L2 rules yet, still move the parent toward the L1 condition

[7] [7]

pos_0",

Tactics from L2 rules and exact-parent history: Use L2 positive rules as helpful sub-patterns when they fit this parent. Avoid L2 negative rules when relevant. Use exact-parent history to avoid repeats and avoid edits that already failed on this parent. Each history entry gives you raw evidence only: - the exact child_fitness achieved on this parent - the...