pith. sign in

arxiv: 2606.09498 · v1 · pith:B3RCWUR7new · submitted 2026-06-08 · 💻 cs.CL

Self-Harness: Harnesses That Improve Themselves

Pith reviewed 2026-06-27 16:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-improving agentsLLM harness designmodel-specific adaptationiterative refinementagent optimizationfailure pattern miningregression testing for agents
0
0 comments X

The pith

An LLM-based agent can improve its own harness by mining model-specific failures from traces and proposing minimal edits that pass regression tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that different base models need different harnesses to interact effectively with environments, yet harnesses are usually written by humans. Self-Harness replaces that step with an automated loop in which the agent itself extracts failure patterns, suggests small targeted changes to the harness, and only keeps changes that improve performance on held-out tests. On Terminal-Bench-2.0 the method raised pass rates for three unrelated models by roughly 15 to 21 points. The central demonstration is that the same model can both operate inside a harness and edit that harness without external supervision or stronger models.

Core claim

Self-Harness is realized as a three-stage loop—Weakness Mining extracts recurring failure patterns from execution traces, Harness Proposal generates diverse but minimal edits tied to those patterns, and Proposal Validation accepts an edit only after it passes regression testing on held-out tasks. When run on Terminal-Bench-2.0 starting from a minimal harness, the loop raised held-out pass rates from 40.5 % to 61.9 % for MiniMax M2.5, from 23.8 % to 38.1 % for Qwen3.5-35B-A3B, and from 42.9 % to 57.1 % for GLM-5. The resulting harnesses contain concrete, executable changes rather than generic instructions, and the gains are specific to each model’s observed weaknesses.

What carries the argument

The Self-Harness iterative loop of Weakness Mining, Harness Proposal, and Proposal Validation that turns execution traces into model-specific harness edits.

If this is right

  • Harness design becomes model-specific and automatic rather than manually engineered once per model family.
  • Performance gains appear without access to stronger external agents or human-written instructions beyond the initial minimal harness.
  • Qualitative inspection shows edits address concrete failure modes instead of adding blanket rules.
  • The same loop can be restarted on any new base model using only its own traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could let newly released models begin improving their own harnesses within hours of deployment rather than waiting for expert updates.
  • If the mining step generalizes, similar loops might be applied to other agent components such as tool-use formats or memory structures.
  • Repeated iterations might produce harnesses that become increasingly specialized to a single model’s error distribution.

Load-bearing premise

The LLM can accurately extract model-specific failure patterns from traces and generate minimal, non-regressive harness edits that generalize beyond the traces used for mining.

What would settle it

Applying the final harnesses produced by Self-Harness to a fresh set of tasks drawn from the same benchmark distribution yields no net gain or produces regressions on more than half the tasks.

Figures

Figures reproduced from arXiv: 2606.09498 by Chen Zhang, Hangfan Zhang, Kangcong Li, Lei Bai, Shao Zhang, Shuyue Hu, Yang Chen, Yiqun Zhang.

Figure 1
Figure 1. Figure 1: Three paradigms of harness improvement. In human harness engineering, human engineers [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of one Self-Harness optimization loop. The current harness [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Initial harness and editable interface used as the starting point for Self-Harness. The harness [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pass rates (%) on Terminal-Bench-2.0 across MiniMax M2.5, Qwen3.5-35B-A3B, and [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MiniMax M2.5 Self-Harness run. Panel (a) summarizes the Self-Harness evolution [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3.5 Self-Harness run. Panel (a) summarizes the Self-Harness evolution trajectory, [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study of a MiniMax M2.5 harness edit on the Terminal-Bench-2.0 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study of a Qwen3.5 harness edit on the Terminal-Bench-2.0 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study of a GLM-5 harness edit on the Terminal-Bench-2.0 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GLM-5 Self-Harness run. Panel (a) summarizes the Self-Harness evolution trajectory, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Self-Harness, an iterative self-improvement loop for LLM agents consisting of Weakness Mining (extracting model-specific failure patterns from traces), Harness Proposal (generating minimal edits), and Proposal Validation (regression testing). Experiments on Terminal-Bench-2.0 with three base models (MiniMax M2.5, Qwen3.5-35B-A3B, GLM-5) report held-out pass-rate gains from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, with qualitative claims that edits are model-specific rather than generic.

Significance. If the generalization mechanism holds, the work offers a concrete path to model-specific harness adaptation without external supervision or human engineers, addressing a scaling bottleneck as LLMs diversify. The use of an external validation stage and held-out evaluation is a positive design choice that reduces obvious circularity risks.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The headline gains are reported without any quantitative details on iteration count, total traces mined, edit diff sizes, or failure-pattern precision/recall. This makes it impossible to assess whether the held-out lift is driven by the claimed mechanism or by incidental effects of the iterative loop.
  2. [§3.2, §4.3] §3.2 (Harness Proposal) and §4.3 (Qualitative analysis): The claim that edits are 'minimal' and 'non-regressive' and generalize beyond mining traces is central to the contribution, yet no ablation against generic prompt lengthening or random edits is provided, nor is there a breakdown of how many proposals were rejected by Proposal Validation.
  3. [§4.1] §4.1 (Setup): The construction of the mining set versus the held-out set is not described with sufficient precision to rule out distributional overlap or trace leakage; regression testing alone does not guarantee transfer if the validation slice shares task characteristics with the mining data.
minor comments (2)
  1. [§3] Notation for the three stages is introduced without a compact diagram or pseudocode listing the exact inputs/outputs of each stage.
  2. [§4] Table 1 (or equivalent results table) should report per-model iteration counts and number of accepted edits to allow readers to gauge computational cost.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and indicate where revisions to the manuscript are planned.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The headline gains are reported without any quantitative details on iteration count, total traces mined, edit diff sizes, or failure-pattern precision/recall. This makes it impossible to assess whether the held-out lift is driven by the claimed mechanism or by incidental effects of the iterative loop.

    Authors: We agree that the current presentation in the abstract and §4 omits these operational details, which limits interpretability of the results. In the revised manuscript we will expand §4 with a dedicated subsection (and corresponding table) that reports iteration counts per model, total traces processed during mining, average token-level edit sizes, and the precision/recall of the weakness-mining stage as measured against human-annotated failure patterns. These additions will make the scale and selectivity of the loop explicit without altering any performance numbers. revision: yes

  2. Referee: [§3.2, §4.3] §3.2 (Harness Proposal) and §4.3 (Qualitative analysis): The claim that edits are 'minimal' and 'non-regressive' and generalize beyond mining traces is central to the contribution, yet no ablation against generic prompt lengthening or random edits is provided, nor is there a breakdown of how many proposals were rejected by Proposal Validation.

    Authors: The manuscript currently supports minimality and non-regressiveness only through the design of Proposal Validation (regression testing on held-out tasks) and the qualitative examples in §4.3; no quantitative ablation or rejection-rate statistics are reported. We therefore plan a partial revision: we will add (i) a new paragraph in §4.3 reporting the fraction of proposals rejected by validation and (ii) a concise ablation subsection comparing our method against both random edits and generic prompt lengthening on the same held-out set. We maintain that the existing qualitative evidence already indicates model-specific rather than generic changes, but the requested ablation will strengthen that claim. revision: partial

  3. Referee: [§4.1] §4.1 (Setup): The construction of the mining set versus the held-out set is not described with sufficient precision to rule out distributional overlap or trace leakage; regression testing alone does not guarantee transfer if the validation slice shares task characteristics with the mining data.

    Authors: We acknowledge that §4.1 currently gives only a high-level statement of the split. In the revision we will replace this with an explicit description: tasks were partitioned by category (terminal command families) into a 60 % mining pool and a 40 % strictly held-out pool before any traces were collected; mining traces were generated exclusively on the mining pool; and the validation stage used only the held-out pool. This protocol eliminates trace leakage and reduces distributional overlap by construction. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains rest on held-out validation

full rationale

The paper presents an iterative empirical procedure (Weakness Mining → Harness Proposal → Proposal Validation via regression testing) whose central claims are performance lifts measured on explicitly held-out tasks. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The validation stage is described as an external check independent of the mining traces, rendering the reported improvements (40.5%→61.9%, etc.) falsifiable rather than tautological. This is the normal case of a self-contained experimental method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5843 in / 1056 out tokens · 26637 ms · 2026-06-27T16:24:11.277956+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 2 canonical work pages

  1. [1]

    Catarena: Evaluation of llm agents through iterative tournament competitions

    Lingyue Fu, Xin Ding, Linyue Pan, Yaoming Zhu, Shao Zhang, Lin Qiu, Weiwen Liu, Weinan Zhang, Xuezhi Cao, Xunliang Cai, Jiaxin Ding, and Yong Yu. Catarena: Evaluation of llm agents through iterative tournament competitions. InInternational Conference on Machine Learning. PMLR, 2026

  2. [2]

    Glm-5: from vibe coding to agentic engineering, 2026

    GLM-5 Team. Glm-5: from vibe coding to agentic engineering, 2026. URL https://arxiv. org/abs/2602.15763

  3. [3]

    Automated design of agentic systems, 2025

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2025. URL https://arxiv.org/abs/2408.08435

  4. [4]

    Deepagents, 2026

    LangChain. Deepagents, 2026. URL https://github.com/langchain-ai/deepagents. Software framework

  5. [5]

    Meta-harness: End-to-end optimization of model harnesses, 2026

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URL https://arxiv.org/ abs/2603.28052

  6. [6]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2005.11401

  7. [7]

    Anticipatory planning for multimodal ai agents

    Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan, Gang Wu, Franck Dernoncourt, Jihyung Kil, Ryan A Rossi, and Ruiyi Zhang. Anticipatory planning for multimodal ai agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5925–5935, 2026

  8. [8]

    Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026

    Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026. URL https:// arxiv.org/abs/2604.25850

  9. [9]

    Dive into claude code: The design space of today’s and future ai agent systems, 2026

    Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. Dive into claude code: The design space of today’s and future ai agent systems, 2026. URLhttps://arxiv.org/abs/ 2604.14228

  10. [10]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URLhttps://arxiv.org/abs/2107.13586. 14

  11. [11]

    The ai scientist: Towards fully automated open-ended scientific discovery, 2024

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292

  12. [12]

    A survey of context engineering for large language models, 2025

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URL https://arxiv.org/abs/2507.13334

  13. [13]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  14. [14]

    Minimax m2.5: Built for real-world productivity, February 2026

    MiniMax. Minimax m2.5: Built for real-world productivity, February 2026. URL https: //www.minimax.io/news/minimax-m25. Official model report

  15. [15]

    Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algori...

  16. [16]

    Codex, 2026

    OpenAI. Codex, 2026. URLhttps://openai.com/codex/. Product page

  17. [17]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

  18. [18]

    Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789

  19. [19]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025. URL https://arxi...

  20. [20]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5. Official model report and model card for Qwen3.5-35B-A3B

  21. [21]

    Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, and Philip Resnik

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, G...

  22. [22]

    Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2024

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2024. URLhttps://arxiv.org/abs/2310.11324

  23. [23]

    Reflexion: Language agents with verbal reinforcement learning, 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

  24. [24]

    Openhands: An open platform for ai software developers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025

  25. [25]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URL https://arxiv. org/abs/2201.11903

  26. [26]

    Controllable and verifiable tool-use data synthesis for agentic reinforcement learning.arXiv preprint arXiv:2604.09813, 2026

    Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, et al. Controllable and verifiable tool-use data synthesis for agentic reinforcement learning.arXiv preprint arXiv:2604.09813, 2026

  27. [27]

    The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URLhttps://arxiv.org/abs/2504.08066

  28. [28]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

  29. [29]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

  30. [30]

    Gödel agent: A self-referential agent framework for recursively self-improvement

    Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. Gödel agent: A self-referential agent framework for recursively self-improvement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 27890–27913, 2025. doi: 10.18653/v1/2025.acl-long.1354. URL https: //...

  31. [31]

    Self-taught optimizer (stop): Recursively self-improving code generation, 2024

    Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. Self-taught optimizer (stop): Recursively self-improving code generation, 2024. URL https://arxiv.org/abs/ 2310.02304

  32. [32]

    The path of self-evolving large language models: Achieving data-efficient learning via intrinsic feedback.arXiv preprint arXiv:2510.02752, 2025

    Hangfan Zhang, Siyuan Xu, Zhimeng Guo, Huaisheng Zhu, Shicheng Liu, Xinrun Wang, Qiaosheng Zhang, Yang Chen, Peng Ye, Lei Bai, et al. The path of self-evolving large language models: Achieving data-efficient learning via intrinsic feedback.arXiv preprint arXiv:2510.02752, 2025

  33. [33]

    Darwin gödel machine: Open-ended evolution of self-improving agents, 2025

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents, 2025. URL https://arxiv.org/abs/2505. 22954

  34. [34]

    Agentic context engineering: Evolving contexts for self-improving language models, 2026

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2026. URLhttps://arxiv.org/abs/2510.04618. ICLR 2026

  35. [35]

    Leveraging dual process theory in language agent framework for real-time simultaneous human-AI collaboration

    Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, and Ying Wen. Leveraging dual process theory in language agent framework for real-time simultaneous human-AI collaboration. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors...

  36. [36]

    Semaclaw: A step towards general-purpose personal ai agents through harness engineering, 2026

    Ningyan Zhu, Huacan Wang, Jie Zhou, Feiyu Chen, Shuo Zhang, Ge Chen, Chen Liu, Jiarou Wu, Wangyi Chen, Xiaofeng Mou, and Yi Xu. Semaclaw: A step towards general-purpose personal ai agents through harness engineering, 2026. URLhttps://arxiv.org/abs/2604.11548

  37. [37]

    Language agents as optimizable graphs, 2024

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Juergen Schmidhuber. Language agents as optimizable graphs, 2024. URL https://arxiv. org/abs/2402.16823. 17 A Additional Implementation Details A.1 Experimental details Model inference servicesMiniMax M2.5 and GLM-5 were accessed through hosted inference services, using Mi...