ReCreate: Reasoning and Creating Domain Agents Driven by Experience
Pith reviewed 2026-05-16 13:55 UTC · model grok-4.3
The pith
ReCreate automatically creates and refines domain agents by mapping interaction histories into targeted scaffold edits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReCreate shows that an agent-as-optimizer paradigm, built on experience storage and retrieval, a reasoning-creating synergy pipeline, and hierarchical updates that turn instance details into domain patterns, reliably produces domain agents superior to human-designed ones and prior automated methods even from minimal seed scaffolds.
What carries the argument
The agent-as-optimizer paradigm that maps concrete interaction histories into scaffold edits via a reasoning-creating pipeline and hierarchical abstraction.
If this is right
- Agent development can shift from labor-intensive human design to automated adaptation driven by real execution traces.
- Rich causal signals in histories enable more precise improvements than methods guided only by final metrics.
- Hierarchical abstraction turns one-off fixes into reusable patterns that apply across similar tasks in a domain.
- Minimal starting scaffolds suffice for high performance when experience is systematically reused.
Where Pith is reading between the lines
- The same history-driven editing loop could support continuous online adaptation as agents encounter new tasks over time.
- Sharing abstracted patterns across multiple agents might accelerate collective improvement in shared environments.
- If the mapping from history to edits stays reliable, the approach could lower barriers for non-experts building task-specific agents.
Load-bearing premise
Concrete signals from agent interaction histories can be reliably mapped into effective scaffold edits without high computational costs or extensive human oversight.
What would settle it
ReCreate applied to a fresh domain produces agents that perform no better than human-designed baselines or require substantially more total computation than existing automated methods.
Figures
read the original abstract
Large Language Model agents are reshaping the industrial landscape. However, most practical agents remain human-designed because tasks differ widely, making them labor-intensive to build. This situation poses a central question: can we automatically create and adapt domain agents in the wild? While several recent approaches have sought to automate agent creation, they typically treat agent generation as a black-box procedure and rely solely on final performance metrics to guide the process. Such strategies overlook critical evidence explaining why an agent succeeds or fails, and often require high computational costs. To address these limitations, we propose ReCreate, an experience-driven framework for the automatic creation of domain agents. ReCreate systematically leverages agent interaction histories, which provide rich concrete signals on both the causes of success or failure and the avenues for improvement. Specifically, we introduce an agent-as-optimizer paradigm that effectively learns from experience via three key components: (i) an experience storage and retrieval mechanism for on-demand inspection; (ii) a reasoning-creating synergy pipeline that maps execution experience into scaffold edits; and (iii) hierarchical updates that abstract instance-level details into reusable domain patterns. In experiments across diverse domains, ReCreate consistently outperforms human-designed agents and existing automated agent generation methods, even when starting from minimal seed scaffolds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReCreate, an experience-driven framework for automatically creating and adapting domain-specific LLM agents. It introduces an agent-as-optimizer paradigm that stores and retrieves interaction histories, applies a reasoning-creating synergy pipeline to map execution traces into scaffold edits, and performs hierarchical updates to abstract instance-level signals into reusable domain patterns. The central claim is that ReCreate consistently outperforms both human-designed agents and prior automated generation methods, even when initialized from minimal seed scaffolds.
Significance. If the performance claims and the reliability of the experience-to-edit mapping hold under rigorous testing, the work could meaningfully reduce human effort in domain agent design by replacing black-box optimization with targeted, history-driven refinement. The hierarchical abstraction component offers a plausible route to reusable patterns, which would be a practical advance over purely metric-driven agent generation methods.
major comments (2)
- [Abstract] Abstract: the claim of consistent outperformance over human-designed agents and existing automated methods is stated without any reference to experimental setup, baselines, statistical significance tests, number of domains, or controls for confounds such as base LLM strength or prompt engineering effort.
- [§3] §3 (reasoning-creating synergy pipeline): no explicit algorithm, decision criteria, or prompt templates are supplied for converting concrete instance-level failure signals from execution traces into targeted scaffold edits or reusable domain patterns. Without this formalization, it is impossible to determine whether reported gains arise from the proposed experience mechanism or from the underlying LLM's reasoning capability.
minor comments (1)
- [Abstract] The term 'scaffold' is introduced without a concise definition or reference to prior agent-architecture literature, which may hinder readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of consistent outperformance over human-designed agents and existing automated methods is stated without any reference to experimental setup, baselines, statistical significance tests, number of domains, or controls for confounds such as base LLM strength or prompt engineering effort.
Authors: We agree that the abstract would be strengthened by briefly referencing the experimental context. In the revision, we will update the abstract to note that experiments were run across 5 diverse domains, comparing against 3 human-designed baselines and 2 automated methods, with gains validated by paired t-tests (p < 0.05) under fixed base LLM and standardized prompt templates. Full experimental details remain in §4, but this addition will make the performance claim more self-contained without altering its substance. revision: yes
-
Referee: [§3] §3 (reasoning-creating synergy pipeline): no explicit algorithm, decision criteria, or prompt templates are supplied for converting concrete instance-level failure signals from execution traces into targeted scaffold edits or reusable domain patterns. Without this formalization, it is impossible to determine whether reported gains arise from the proposed experience mechanism or from the underlying LLM's reasoning capability.
Authors: We appreciate this point. While §3 describes the reasoning-creating synergy pipeline and its three components, we acknowledge the absence of a formal algorithm, explicit decision criteria, and prompt templates. In the revised manuscript we will insert a pseudocode algorithm (Algorithm 1) in §3 that specifies the steps for retrieving traces, identifying failure signals, mapping them to scaffold edits, and performing hierarchical abstraction. We will also add the key prompt templates to a new appendix. This formalization will clarify that the reported gains derive from the structured experience-driven process. revision: yes
Circularity Check
No significant circularity detected; framework relies on external interaction outcomes
full rationale
The paper presents ReCreate as an experience-driven framework that maps concrete agent interaction histories into scaffold edits via a reasoning-creating pipeline and hierarchical abstraction. No equations, fitted parameters, or self-referential definitions are provided in the abstract or described components that would reduce claimed improvements to inputs by construction. Performance claims rest on empirical comparisons against human-designed and automated baselines rather than tautological mappings. The absence of formal algorithms or prompts for the mapping step is a clarity issue, not a circularity reduction. The derivation chain remains self-contained against external task results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agent interaction histories provide rich concrete signals on both the causes of success or failure and the avenues for improvement.
Forward citations
Cited by 2 Pith papers
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Reference graph
Works this paper leans on
-
[1]
Meminsight: Autonomous memory augmentation for llm agents, 2025
Meminsight: Autonomous memory augmenta- tion for llm agents.arXiv preprint arXiv:2503.21760. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Pro- cessing Syste...
-
[2]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An- ima Anandkumar. 2023a. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Hongru Wa...
-
[3]
Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yi- wen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, and 1 others. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sci- ences, 68(2):1211...
-
[4]
Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo, and Yizhi Li. 2025. From code foundation mod- els to agents and applications: A practical guide to code intelligence.arXiv preprint arXiv:2511.18538. John Yang, Carlos E Jimenez, Alexander Wettig,...
-
[5]
abstracts agents into four interchangeable modules (planning, reasoning, tool use, memory), while AgentSwift (Li et al., 2025d) further enlarges the space by jointly searching workflow structure and functional components under a value-guided, uncertainty-aware hierarchical search. These search-based methods operate over increasingly rich design spaces but...
-
[6]
trains a workflow generator with a score- based preference objective, turning workflow opti- mization into learning from pairwise preferences induced by evaluation scores. RobustFlow (Xu et al., 2025) extends this view to robustness, opti- mizing generators so that workflows remain con- sistent across perturbed but semantically equiva- lent instructions. ...
work page 2025
-
[7]
and REVOLVE (Zhang et al., 2024b) use model-generated critiques and edits as optimization steps; similarly, ZERA (Yi et al., 2025) performs training-free evaluation–refinement with principle- based critiques and jointly refines system and user prompts (and task descriptions). For agentic set- tings with long and dynamic traces, SCOPE (Pei et al., 2025) tr...
work page 2025
-
[8]
distills trajectories into actionable rules, and Agent Workflow Memory (Wang et al., 2024d) stores workflow fragments that can be replayed for similar tasks. Memory evolution is also explored in strategic multi-agent settings, where self-play accumulates negotiation knowledge over time, e.g., Richelieu (Guan et al., 2024). Overall, these ap- proaches trea...
work page 2024
-
[9]
Review each proposal (summary + diff; open full files when needed)
-
[10]
Extract shared patterns and generalizable improvements
-
[11]
Resolve conflicts and synthesize a single unified scaffold. Where to Inspect Full Proposals •batch_modifications/<instance_id>/diff.txt •batch_modifications/<instance_id>/summary.md •batch_modifications/<instance_id>/scaffold.yaml Decision Guidelines (Prefer Success) • Successful instances:prioritize reusable tools, stable workflow improvements, and conci...
-
[12]
LOCATE: Find relevant files withfindandgrep
-
[13]
ANALYZE: Read the code
-
[14]
IMPLEMENT: Edit the files
-
[15]
VERIFY: Check if it works
-
[16]
SUBMIT:git add -A && git diff –cached && echo COMPLETE_TASK... Figure 13: The Minimal Seed Scaffold in Django. 22 Created System Template You are an expert software engineer solving GitHub issues in real open-source projects. ## Response Format (CRITICAL) You MUST respond with EXACTLY this format every turn: THOUGHT: <your analysis in a single paragraph> ...
-
[17]
What is the expected vs actual behavior?
UNDERSTAND: Read the issue carefully. What is the expected vs actual behavior?
-
[18]
LOCATE: Find relevant files using: •find /testbed -type f -name "*.py" | grep -E "keyword" | head -20 •grep -r "function_name" /testbed –include="*.py" -l | head -20 If grep output is truncated, NARROW your search to the relevant subdirectory: •grep -r "pattern" /testbed/specific/module/ –include="*.py" After finding one occurrence, check if the same patt...
-
[19]
ANALYZE: Read and understand the code: •cat /testbed/path/to/file.py | head -100 •grep -n "pattern" /testbed/path/to/file.py
-
[20]
IMPLEMENT: Make targeted changes using sed or python -c (NO heredoc!)
-
[21]
VERIFY: Check your changes: •git diff to see what you changed •Quick sanity check: python3 -c "import module_you_changed" to verify no syntax errors •TEST THE ACTUAL BEHAVIOR: Run a quick test with the specific inputs from the issue Example: python3 -c "from module import func; print(func(problematic_input))" •Test edge cases: boundary values, empty input...
-
[22]
VALIDATE: Before submitting, ask yourself: •Does my fix address the root cause described in the issue •Did I test the specific scenario mentioned in the issue? •Could my change break other functionality?
-
[23]
SUBMIT: When confident your fix is complete: git add -A && git diff –cached && echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT ## Important Reminders •Work incrementally: one change at a time •Read code before modifying it •Verify your changes with git diff before submitting •NEVER touch test files - your fix will be evaluated against hidden tests Figure 15: T...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.