ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Can Wang; Hande Dong; Hong Wang; Jian Luo; Jianqing Zhang; Jiawei Chen; Qiang Lin; Yuyan Zhou; Zhezheng Hao

arxiv: 2601.11100 · v2 · submitted 2026-01-16 · 💻 cs.AI

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Zhezheng Hao , Hong Wang , Jian Luo , Jianqing Zhang , Yuyan Zhou , Qiang Lin , Can Wang , Hande Dong

show 1 more author

Jiawei Chen

This is my paper

Pith reviewed 2026-05-16 13:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsautomated agent creationexperience-driven optimizationagent scaffoldinginteraction historiesdomain adaptationreasoning-creating pipeline

0 comments

The pith

ReCreate automatically creates and refines domain agents by mapping interaction histories into targeted scaffold edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReCreate as a framework that treats agent creation as an iterative optimization process fueled by concrete records of past successes and failures rather than final performance scores alone. It claims this experience-driven approach can produce agents that outperform both human-designed baselines and earlier automated generation methods, even when beginning from minimal seed structures. A reader would care because practical agents for varied tasks currently demand extensive manual design; automating adaptation from real usage data could lower that barrier. The method stores histories for inspection, applies a reasoning-creating pipeline to derive edits, and abstracts details into reusable domain patterns through hierarchical updates. Experiments across domains support the claim that these steps yield consistent gains without relying solely on black-box feedback.

Core claim

ReCreate shows that an agent-as-optimizer paradigm, built on experience storage and retrieval, a reasoning-creating synergy pipeline, and hierarchical updates that turn instance details into domain patterns, reliably produces domain agents superior to human-designed ones and prior automated methods even from minimal seed scaffolds.

What carries the argument

The agent-as-optimizer paradigm that maps concrete interaction histories into scaffold edits via a reasoning-creating pipeline and hierarchical abstraction.

If this is right

Agent development can shift from labor-intensive human design to automated adaptation driven by real execution traces.
Rich causal signals in histories enable more precise improvements than methods guided only by final metrics.
Hierarchical abstraction turns one-off fixes into reusable patterns that apply across similar tasks in a domain.
Minimal starting scaffolds suffice for high performance when experience is systematically reused.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same history-driven editing loop could support continuous online adaptation as agents encounter new tasks over time.
Sharing abstracted patterns across multiple agents might accelerate collective improvement in shared environments.
If the mapping from history to edits stays reliable, the approach could lower barriers for non-experts building task-specific agents.

Load-bearing premise

Concrete signals from agent interaction histories can be reliably mapped into effective scaffold edits without high computational costs or extensive human oversight.

What would settle it

ReCreate applied to a fresh domain produces agents that perform no better than human-designed baselines or require substantially more total computation than existing automated methods.

Figures

Figures reproduced from arXiv: 2601.11100 by Can Wang, Hande Dong, Hong Wang, Jian Luo, Jianqing Zhang, Jiawei Chen, Qiang Lin, Yuyan Zhou, Zhezheng Hao.

**Figure 1.** Figure 1: The overview of ReCreate. as models cross the critical threshold of reasoning and creativity, the labor-intensive process of agent creation can finally be automated by the agents themselves. 4.3 The Agent-as-optimizer Design While the ReCreate framework leverages interaction experience to improve agent creation, effectively exploiting it is non-trivial for three challenges: (1) the full interaction expe… view at source ↗

**Figure 2.** Figure 2: The pipeline of ReCreate. ReCreate-Agent iteratively reasons and acts to locate key evidence on why the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Action distributions of the ReCreate-Agent. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Ablations on experience components. 5.4 Ablation Study Observation-level ablation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Reasoning-level Ablation. 0 10 20 30 40 50 Cost ($) SWE DA-code DS-1000 Math Digital 27.31 21.29 15.35 7.41 8.85 42.40 42.19 43.24 42.17 16.19 ReCreate ADAS [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Cost Comparison. ReCreate-Agent’s ability to interpret interaction experience and translate it into actionable scaffold updates. Furthermore, it suggests that frontier LLMs are approaching the point of matching or even replacing expert-designed scaffolds in practice. 5.5 Cost Beyond performance, we also assessed the costeffectiveness of ReCreate compared to automated agent generation methods [PITH_FUL… view at source ↗

**Figure 7.** Figure 7: Scaffolds gate what a base model can do do not merely guide outputs: they also change the agent’s search behavior (what to inspect, which checks to run, how to iterate), effectively routing the model to different solvable regions. The right panel quantifies this effect by counting, for each scaffold S, how many issues in the union U it fails to solve (i.e., issues solved by at least one other scaffold). Ev… view at source ↗

**Figure 8.** Figure 8: A case study for adding rules [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: A case study for adding tools. Motivation 3: Workflows (Log: SWE-bench/django-45) Context: Fixed bug in django/utils/text.py. Tests passed. [STEP 12] Submission > THOUGHT: ...regression tests are passing. I will now commit the changes to the repository and submit the final patch. > ACTION: git add django/utils/text.py # <--- [FATAL FLAW] Committing clears the Index git commit -m "Fix regression in slugify"… view at source ↗

**Figure 10.** Figure 10: A case study for enforcing workflows. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Main prompt for ReCreate-Agent. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Batch synthesis prompt for aggregating instance-level scaffold edits into a unified global scaffold. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: The Minimal Seed Scaffold in Django. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: The System Template created from Django experience.. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: The Instance Template created from Django experience. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: The Memory Template created from Django experience. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: A snapshot of the static memory accumulated by ReCreate. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: A tool created from Django experience. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

read the original abstract

Large Language Model agents are reshaping the industrial landscape. However, most practical agents remain human-designed because tasks differ widely, making them labor-intensive to build. This situation poses a central question: can we automatically create and adapt domain agents in the wild? While several recent approaches have sought to automate agent creation, they typically treat agent generation as a black-box procedure and rely solely on final performance metrics to guide the process. Such strategies overlook critical evidence explaining why an agent succeeds or fails, and often require high computational costs. To address these limitations, we propose ReCreate, an experience-driven framework for the automatic creation of domain agents. ReCreate systematically leverages agent interaction histories, which provide rich concrete signals on both the causes of success or failure and the avenues for improvement. Specifically, we introduce an agent-as-optimizer paradigm that effectively learns from experience via three key components: (i) an experience storage and retrieval mechanism for on-demand inspection; (ii) a reasoning-creating synergy pipeline that maps execution experience into scaffold edits; and (iii) hierarchical updates that abstract instance-level details into reusable domain patterns. In experiments across diverse domains, ReCreate consistently outperforms human-designed agents and existing automated agent generation methods, even when starting from minimal seed scaffolds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReCreate gives a practical experience-driven loop for building domain agents, but the outperformance claims rest on details not visible in the abstract.

read the letter

The core idea here is straightforward: instead of treating agent generation as a black-box search over final scores, ReCreate stores full interaction histories, reasons over why things worked or failed, and turns those signals into targeted edits to the agent scaffold. The three pieces—experience retrieval, the reasoning-creating pipeline, and hierarchical abstraction from instances to domain patterns—form a closed loop that starts from minimal seeds and aims to produce reusable agents. That combination is the actual novelty; it moves beyond pure performance-driven optimization by trying to extract explanatory traces.

Referee Report

2 major / 1 minor

Summary. The paper proposes ReCreate, an experience-driven framework for automatically creating and adapting domain-specific LLM agents. It introduces an agent-as-optimizer paradigm that stores and retrieves interaction histories, applies a reasoning-creating synergy pipeline to map execution traces into scaffold edits, and performs hierarchical updates to abstract instance-level signals into reusable domain patterns. The central claim is that ReCreate consistently outperforms both human-designed agents and prior automated generation methods, even when initialized from minimal seed scaffolds.

Significance. If the performance claims and the reliability of the experience-to-edit mapping hold under rigorous testing, the work could meaningfully reduce human effort in domain agent design by replacing black-box optimization with targeted, history-driven refinement. The hierarchical abstraction component offers a plausible route to reusable patterns, which would be a practical advance over purely metric-driven agent generation methods.

major comments (2)

[Abstract] Abstract: the claim of consistent outperformance over human-designed agents and existing automated methods is stated without any reference to experimental setup, baselines, statistical significance tests, number of domains, or controls for confounds such as base LLM strength or prompt engineering effort.
[§3] §3 (reasoning-creating synergy pipeline): no explicit algorithm, decision criteria, or prompt templates are supplied for converting concrete instance-level failure signals from execution traces into targeted scaffold edits or reusable domain patterns. Without this formalization, it is impossible to determine whether reported gains arise from the proposed experience mechanism or from the underlying LLM's reasoning capability.

minor comments (1)

[Abstract] The term 'scaffold' is introduced without a concise definition or reference to prior agent-architecture literature, which may hinder readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of consistent outperformance over human-designed agents and existing automated methods is stated without any reference to experimental setup, baselines, statistical significance tests, number of domains, or controls for confounds such as base LLM strength or prompt engineering effort.

Authors: We agree that the abstract would be strengthened by briefly referencing the experimental context. In the revision, we will update the abstract to note that experiments were run across 5 diverse domains, comparing against 3 human-designed baselines and 2 automated methods, with gains validated by paired t-tests (p < 0.05) under fixed base LLM and standardized prompt templates. Full experimental details remain in §4, but this addition will make the performance claim more self-contained without altering its substance. revision: yes
Referee: [§3] §3 (reasoning-creating synergy pipeline): no explicit algorithm, decision criteria, or prompt templates are supplied for converting concrete instance-level failure signals from execution traces into targeted scaffold edits or reusable domain patterns. Without this formalization, it is impossible to determine whether reported gains arise from the proposed experience mechanism or from the underlying LLM's reasoning capability.

Authors: We appreciate this point. While §3 describes the reasoning-creating synergy pipeline and its three components, we acknowledge the absence of a formal algorithm, explicit decision criteria, and prompt templates. In the revised manuscript we will insert a pseudocode algorithm (Algorithm 1) in §3 that specifies the steps for retrieving traces, identifying failure signals, mapping them to scaffold edits, and performing hierarchical abstraction. We will also add the key prompt templates to a new appendix. This formalization will clarify that the reported gains derive from the structured experience-driven process. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; framework relies on external interaction outcomes

full rationale

The paper presents ReCreate as an experience-driven framework that maps concrete agent interaction histories into scaffold edits via a reasoning-creating pipeline and hierarchical abstraction. No equations, fitted parameters, or self-referential definitions are provided in the abstract or described components that would reduce claimed improvements to inputs by construction. Performance claims rest on empirical comparisons against human-designed and automated baselines rather than tautological mappings. The absence of formal algorithms or prompts for the mapping step is a clarity issue, not a circularity reduction. The derivation chain remains self-contained against external task results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that interaction histories contain usable improvement signals and that hierarchical abstraction can produce reusable patterns; no explicit free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Agent interaction histories provide rich concrete signals on both the causes of success or failure and the avenues for improvement.
This premise underpins the entire experience-driven mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5536 in / 1065 out tokens · 29951 ms · 2026-05-16T13:55:42.328072+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
cs.CL 2026-05 unverdicted novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
cs.AI 2026-05 unverdicted novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers

[1]

Meminsight: Autonomous memory augmentation for llm agents, 2025

Meminsight: Autonomous memory augmenta- tion for llm agents.arXiv preprint arXiv:2503.21760. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Pro- cessing Syste...

work page arXiv 2023
[2]

Trivedi, T

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An- ima Anandkumar. 2023a. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Hongru Wa...

work page arXiv 2025
[3]

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yi- wen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, and 1 others. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sci- ences, 68(2):1211...

work page arXiv 2025
[4]

From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence.arXiv preprint arXiv:2511.18538, 2025

Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo, and Yizhi Li. 2025. From code foundation mod- els to agents and applications: A practical guide to code intelligence.arXiv preprint arXiv:2511.18538. John Yang, Carlos E Jimenez, Alexander Wettig,...

work page arXiv 2025
[5]

abstracts agents into four interchangeable modules (planning, reasoning, tool use, memory), while AgentSwift (Li et al., 2025d) further enlarges the space by jointly searching workflow structure and functional components under a value-guided, uncertainty-aware hierarchical search. These search-based methods operate over increasingly rich design spaces but...

work page
[6]

trains a workflow generator with a score- based preference objective, turning workflow opti- mization into learning from pairwise preferences induced by evaluation scores. RobustFlow (Xu et al., 2025) extends this view to robustness, opti- mizing generators so that workflows remain con- sistent across perturbed but semantically equiva- lent instructions. ...

work page 2025
[7]

For agentic set- tings with long and dynamic traces, SCOPE (Pei et al., 2025) treats prompt evolution as an online 14 optimization problem and updates prompts from execution traces

and REVOLVE (Zhang et al., 2024b) use model-generated critiques and edits as optimization steps; similarly, ZERA (Yi et al., 2025) performs training-free evaluation–refinement with principle- based critiques and jointly refines system and user prompts (and task descriptions). For agentic set- tings with long and dynamic traces, SCOPE (Pei et al., 2025) tr...

work page 2025
[8]

model lim- itation

distills trajectories into actionable rules, and Agent Workflow Memory (Wang et al., 2024d) stores workflow fragments that can be replayed for similar tasks. Memory evolution is also explored in strategic multi-agent settings, where self-play accumulates negotiation knowledge over time, e.g., Richelieu (Guan et al., 2024). Overall, these ap- proaches trea...

work page 2024
[9]

Review each proposal (summary + diff; open full files when needed)

work page
[10]

Extract shared patterns and generalizable improvements

work page
[11]

Resolve conflicts and synthesize a single unified scaffold. Where to Inspect Full Proposals •batch_modifications/<instance_id>/diff.txt •batch_modifications/<instance_id>/summary.md •batch_modifications/<instance_id>/scaffold.yaml Decision Guidelines (Prefer Success) • Successful instances:prioritize reusable tools, stable workflow improvements, and conci...

work page
[12]

LOCATE: Find relevant files withfindandgrep

work page
[13]

ANALYZE: Read the code

work page
[14]

IMPLEMENT: Edit the files

work page
[15]

VERIFY: Check if it works

work page
[16]

old text

SUBMIT:git add -A && git diff –cached && echo COMPLETE_TASK... Figure 13: The Minimal Seed Scaffold in Django. 22 Created System Template You are an expert software engineer solving GitHub issues in real open-source projects. ## Response Format (CRITICAL) You MUST respond with EXACTLY this format every turn: THOUGHT: <your analysis in a single paragraph> ...

work page
[17]

What is the expected vs actual behavior?

UNDERSTAND: Read the issue carefully. What is the expected vs actual behavior?

work page
[18]

*.py" | grep -E

LOCATE: Find relevant files using: •find /testbed -type f -name "*.py" | grep -E "keyword" | head -20 •grep -r "function_name" /testbed –include="*.py" -l | head -20 If grep output is truncated, NARROW your search to the relevant subdirectory: •grep -r "pattern" /testbed/specific/module/ –include="*.py" After finding one occurrence, check if the same patt...

work page
[19]

ANALYZE: Read and understand the code: •cat /testbed/path/to/file.py | head -100 •grep -n "pattern" /testbed/path/to/file.py

work page
[20]

IMPLEMENT: Make targeted changes using sed or python -c (NO heredoc!)

work page
[21]

import module_you_changed

VERIFY: Check your changes: •git diff to see what you changed •Quick sanity check: python3 -c "import module_you_changed" to verify no syntax errors •TEST THE ACTUAL BEHAVIOR: Run a quick test with the specific inputs from the issue Example: python3 -c "from module import func; print(func(problematic_input))" •Test edge cases: boundary values, empty input...

work page
[22]

VALIDATE: Before submitting, ask yourself: •Does my fix address the root cause described in the issue •Did I test the specific scenario mentioned in the issue? •Could my change break other functionality?

work page
[23]

keyword" Write memories (save what you learned): python3 /workspace/agent_memory/write_memory.py –title

SUBMIT: When confident your fix is complete: git add -A && git diff –cached && echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT ## Important Reminders •Work incrementally: one change at a time •Read code before modifying it •Verify your changes with git diff before submitting •NEVER touch test files - your fix will be evaluated against hidden tests Figure 15: T...

work page 2026

[1] [1]

Meminsight: Autonomous memory augmentation for llm agents, 2025

Meminsight: Autonomous memory augmenta- tion for llm agents.arXiv preprint arXiv:2503.21760. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Pro- cessing Syste...

work page arXiv 2023

[2] [2]

Trivedi, T

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An- ima Anandkumar. 2023a. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Hongru Wa...

work page arXiv 2025

[3] [3]

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yi- wen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, and 1 others. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sci- ences, 68(2):1211...

work page arXiv 2025

[4] [4]

From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence.arXiv preprint arXiv:2511.18538, 2025

Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo, and Yizhi Li. 2025. From code foundation mod- els to agents and applications: A practical guide to code intelligence.arXiv preprint arXiv:2511.18538. John Yang, Carlos E Jimenez, Alexander Wettig,...

work page arXiv 2025

[5] [5]

abstracts agents into four interchangeable modules (planning, reasoning, tool use, memory), while AgentSwift (Li et al., 2025d) further enlarges the space by jointly searching workflow structure and functional components under a value-guided, uncertainty-aware hierarchical search. These search-based methods operate over increasingly rich design spaces but...

work page

[6] [6]

trains a workflow generator with a score- based preference objective, turning workflow opti- mization into learning from pairwise preferences induced by evaluation scores. RobustFlow (Xu et al., 2025) extends this view to robustness, opti- mizing generators so that workflows remain con- sistent across perturbed but semantically equiva- lent instructions. ...

work page 2025

[7] [7]

For agentic set- tings with long and dynamic traces, SCOPE (Pei et al., 2025) treats prompt evolution as an online 14 optimization problem and updates prompts from execution traces

and REVOLVE (Zhang et al., 2024b) use model-generated critiques and edits as optimization steps; similarly, ZERA (Yi et al., 2025) performs training-free evaluation–refinement with principle- based critiques and jointly refines system and user prompts (and task descriptions). For agentic set- tings with long and dynamic traces, SCOPE (Pei et al., 2025) tr...

work page 2025

[8] [8]

model lim- itation

distills trajectories into actionable rules, and Agent Workflow Memory (Wang et al., 2024d) stores workflow fragments that can be replayed for similar tasks. Memory evolution is also explored in strategic multi-agent settings, where self-play accumulates negotiation knowledge over time, e.g., Richelieu (Guan et al., 2024). Overall, these ap- proaches trea...

work page 2024

[9] [9]

Review each proposal (summary + diff; open full files when needed)

work page

[10] [10]

Extract shared patterns and generalizable improvements

work page

[11] [11]

Resolve conflicts and synthesize a single unified scaffold. Where to Inspect Full Proposals •batch_modifications/<instance_id>/diff.txt •batch_modifications/<instance_id>/summary.md •batch_modifications/<instance_id>/scaffold.yaml Decision Guidelines (Prefer Success) • Successful instances:prioritize reusable tools, stable workflow improvements, and conci...

work page

[12] [12]

LOCATE: Find relevant files withfindandgrep

work page

[13] [13]

ANALYZE: Read the code

work page

[14] [14]

IMPLEMENT: Edit the files

work page

[15] [15]

VERIFY: Check if it works

work page

[16] [16]

old text

SUBMIT:git add -A && git diff –cached && echo COMPLETE_TASK... Figure 13: The Minimal Seed Scaffold in Django. 22 Created System Template You are an expert software engineer solving GitHub issues in real open-source projects. ## Response Format (CRITICAL) You MUST respond with EXACTLY this format every turn: THOUGHT: <your analysis in a single paragraph> ...

work page

[17] [17]

What is the expected vs actual behavior?

UNDERSTAND: Read the issue carefully. What is the expected vs actual behavior?

work page

[18] [18]

*.py" | grep -E

LOCATE: Find relevant files using: •find /testbed -type f -name "*.py" | grep -E "keyword" | head -20 •grep -r "function_name" /testbed –include="*.py" -l | head -20 If grep output is truncated, NARROW your search to the relevant subdirectory: •grep -r "pattern" /testbed/specific/module/ –include="*.py" After finding one occurrence, check if the same patt...

work page

[19] [19]

ANALYZE: Read and understand the code: •cat /testbed/path/to/file.py | head -100 •grep -n "pattern" /testbed/path/to/file.py

work page

[20] [20]

IMPLEMENT: Make targeted changes using sed or python -c (NO heredoc!)

work page

[21] [21]

import module_you_changed

VERIFY: Check your changes: •git diff to see what you changed •Quick sanity check: python3 -c "import module_you_changed" to verify no syntax errors •TEST THE ACTUAL BEHAVIOR: Run a quick test with the specific inputs from the issue Example: python3 -c "from module import func; print(func(problematic_input))" •Test edge cases: boundary values, empty input...

work page

[22] [22]

VALIDATE: Before submitting, ask yourself: •Does my fix address the root cause described in the issue •Did I test the specific scenario mentioned in the issue? •Could my change break other functionality?

work page

[23] [23]

keyword" Write memories (save what you learned): python3 /workspace/agent_memory/write_memory.py –title

SUBMIT: When confident your fix is complete: git add -A && git diff –cached && echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT ## Important Reminders •Work incrementally: one change at a time •Read code before modifying it •Verify your changes with git diff before submitting •NEVER touch test files - your fix will be evaluated against hidden tests Figure 15: T...

work page 2026