pith. sign in

arxiv: 2605.22505 · v1 · pith:ZAKBXKPZnew · submitted 2026-05-21 · 💻 cs.AI

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

Pith reviewed 2026-05-22 06:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords harness optimizationpriority rankingoptimizer evaluationdirect evaluationagent improvementShor scenariosmulti-step optimization
0
0 comments X

The pith

Priority ranking directly evaluates harness optimizers by component impact and predicts their real-world success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluations of harness optimizers rely on final agent performance improvements, which obscures whether updates are informed or random. This paper proposes priority ranking as a direct method: optimizers rank harness components by how much updating them would help or hurt the agent. Using a new set of 182 human-verified scenarios called Shor, the authors demonstrate that better ranking on these scenarios corresponds to better performance in full multi-step optimization tasks. A sympathetic reader would care because this provides a cheaper, step-level way to test optimizers without running full expensive simulations or needing oracles. It helps clarify if harness optimization works through smart choices rather than trial and error.

Core claim

The central discovery is that an optimizer's ability to rank the priority of harness components for improvement correlates with its ability to successfully optimize agents over multiple steps, making priority ranking a reliable and low-cost predictor of optimization ability, supported by the Shor collection of scenarios.

What carries the argument

Priority ranking, which quantifies an optimizer's step-level ability by requiring it to order harness components according to their potential effect on agent performance when updated.

If this is right

  • Optimizers can be assessed at individual update steps without full rollouts.
  • Distinguishes informed decision-making from trial-and-error in optimization.
  • Provides a scalable way to benchmark optimizers across different domains using Shor scenarios.
  • Correlates ranking accuracy with end-to-end optimization gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future optimizer training could incorporate ranking tasks as an auxiliary objective to improve performance.
  • The approach might extend to evaluating other types of iterative AI agents beyond harness optimization.
  • Shor scenarios could serve as a standard benchmark for optimization-related AI tasks.

Load-bearing premise

The 182 human-verified scenarios in Shor capture a representative range of optimization challenges so that ranking skill on them indicates general optimization skill on new harnesses.

What would settle it

Running actual multi-step harness optimizations with high-ranking and low-ranking optimizers on new, unseen scenarios and finding no difference in agent improvement rates would falsify the correlation.

Figures

Figures reproduced from arXiv: 2605.22505 by Bogyung Jeong, Dongwook Choi, Geunha Jang, Jinyoung Yeo, Junhee Cho, Kai Tzu-iunn Ong, Minseok Kang, Minwoo Oh, Seungju Kim, Seungwon Lim, Sunghwan Kim, Taeyoon Kwon.

Figure 1
Figure 1. Figure 1: (Right) End-improvement observation vs. priority ranking. Our design quantifies optimizers’ ability cost-and time-effectively and directly, whereas existing evaluations require running the entire optimization process and offer limited insights; (Left) Examples of erroneous harness updates. act as the optimizer (i.e., outer loop), iteratively updating the harness of a target agent (i.e., inner loop) based o… view at source ↗
Figure 2
Figure 2. Figure 2: Frequency of erroneous updates over the optimization process, regarding each harness component. Analysis I: About half of the optimization steps are considered detrimental. Stud￾ies have reported that harness optimizers make mistakes during the optimization pro￾cess [15, 16]. To investigate the severity of this observation, we quantify it by examining real optimization trajectories (150 harnesses in total)… view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between priority ranking and optimizer ability to improve target agents’ SR in harness optimization. The harness optimization is run for 10 iterations. We report the average results of 5 runs. We use mini-swe-agent (gpt-5-mini) as the target agent. [16], who show that explicitly treating the optimizer’s own harness as an optimization target yields better optimization performance, proving the no… view at source ↗
Figure 4
Figure 4. Figure 4: Correlation ρ across time step intervals of base harnesses. Overall, this general trend of positive correlations justifies our rank￾ing design: In actual harness optimization, the optimizer must understand the relationship between components’ current func￾tional state and expected agent performance. Priority ranking tests the same ability, just in isolation. While this is being said, we note that priority … view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the dataset collecting process. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the Shor dataset, including key statistics (left) and the distribution of instances across timesteps (right). (a) Domain distribution (b) Top-1 component ratio (c) Flawed component ratio [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution statistics of Shor. Auxiliary Field I. The auxiliary field I comprises three groups: metadata, a human-annotated quality label, and by-products from the annotation process. Metadata records the contextual information of each instance, including the base harness performance r A(τT, xT), the domain and task identifier, and the hyperparameters used during annotation (ϵ, δ). Quality Label records … view at source ↗
Figure 8
Figure 8. Figure 8: Correlation between priority rankings and the optimizer’s ability to improve target agents’ SR in harness optimization, evaluated using the NDCG metric. G.2. Harness Optimization We evaluate the same optimizers used in Appendix G.1 on harness optimization across both in-domain and out-of-domain settings. For each domain, we randomly sample 5 base harnesses and use each as the initial harness of the target … view at source ↗
Figure 9
Figure 9. Figure 9: An example optimizer summary Si generated at step i = 2 of a GAIA trajectory, illustrating how the optimizer diagnoses failure modes and proposes targeted harness updates. G.4. Priority Ranking as an Actionable Insight In the experiment, we use SHOR-Flaw as the source of flawed harnesses. The oracle information given to the optimizer in the second setting consists of the target flawed code segment, the hum… view at source ↗
Figure 10
Figure 10. Figure 10: System and Instance prompts of ReCreate-Agent. System Prompt You are an expert evaluator for agent harnesses. Return only valid JSON. Instance Template You must compare exactly two harness candidates. Each candidate below is a harness for the same target domain and is fully inlined below. Domain Section {{ domain_section }} What to evaluate • The system prompt and workflow rules • Every custom tool implem… view at source ↗
Figure 11
Figure 11. Figure 11: System and Instance prompts of the harness evaluator used in Section 3. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Meta-Harness prompt template (1/2): header, constraints, and evolution axes. Meta-Harness Prompt Template for Harness Collection (2/2) Domain Context Task type: {{TASK_DESCRIPTION}} Agent actions (fixed DSL — do not add or remove): {{ACTION_DSL}} Evaluation method: {{EVALUATION_MODE}} Fixed API modules — do NOT modify: {{FIXED_API_MODULES}} Observed baseline failure modes: {{OBSERVED_FAILURE_MODES}} Memor… view at source ↗
Figure 13
Figure 13. Figure 13: Meta-Harness prompt template (2/2): domain context, memory rules, and workflow. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Annotator prompt template with domain- and component-specific placeholders. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Optimizer prompt template for the Memory axis. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Optimizer prompt template for the Prompt axis. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Optimizer prompt template for the Tool axis. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Optimizer prompt template for the Workflow axis. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Harness Optimizer Prompt. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Priority Ranking Prompt. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Flaw annotation tool interface. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Error recovery annotation tool interface. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_22.png] view at source ↗
read the original abstract

Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes priority ranking as a direct, low-cost method to evaluate harness optimizers by requiring them to rank harness components (e.g., tools) according to their potential to improve or hinder target agent performance. This addresses limitations of indirect evaluation via end-performance gains, which overlook intermediate erroneous actions. The approach is instantiated on the Shor dataset of 182 human-verified optimization scenarios spanning domains, designs, and stages; the central empirical claim is that optimizers' ranking accuracy on these scenarios correlates with their success in multi-step harness optimization, positioning priority ranking as a reliable predictor.

Significance. If the reported correlation is robust and generalizes beyond the Shor distribution, the work supplies a practical direct-evaluation primitive that avoids expensive rollouts while quantifying step-level optimizer behavior. The public release of code and data at the cited GitHub repository is a clear strength for reproducibility and follow-on research in automated agent construction.

major comments (3)
  1. [Abstract] Abstract and evaluation section: the assertion that ranking performance 'correlates with their ability to improve agents in actual multi-step harness optimization' is load-bearing for the predictor claim, yet the abstract supplies no correlation coefficient, p-value, sample size, or control for scenario-specific confounds. Explicit statistical reporting and a description of how the correlation was computed are required.
  2. [Shor dataset description] Shor dataset and experimental setup: both ranking and multi-step optimization results are obtained on (subsets of) the same 182 human-verified scenarios. This design leaves open whether the observed correlation reflects genuine optimizer skill or patterns that human verifiers preferentially selected; a hold-out evaluation on unseen harnesses or domains is needed to substantiate the 'reliable predictor' conclusion.
  3. [Priority ranking design] Methods for priority ranking: the procedure for selecting which components to rank and the precise scoring rubric used to obtain ground-truth rankings from human verification must be detailed to confirm that the metric is independent of prior optimizer performance and not circular.
minor comments (3)
  1. [Dataset] Add a short table summarizing the distribution of domains, design types, and temporal stages across the 182 scenarios to demonstrate coverage.
  2. [Introduction] Clarify in the introduction whether 'harness' refers to a specific prompt-engineering construct or a more general agent scaffolding to aid readers outside the immediate sub-area.
  3. [Figures] Ensure any figures showing example rankings include axis labels, legend, and error bars where statistical variation is reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation section: the assertion that ranking performance 'correlates with their ability to improve agents in actual multi-step harness optimization' is load-bearing for the predictor claim, yet the abstract supplies no correlation coefficient, p-value, sample size, or control for scenario-specific confounds. Explicit statistical reporting and a description of how the correlation was computed are required.

    Authors: We agree that the abstract and evaluation section would benefit from explicit statistical details to support the central claim. In the revised manuscript we have added the correlation coefficient, associated p-value, sample size, and a description of the computation method (Pearson correlation between ranking accuracy and multi-step success rate). We have also clarified the controls applied, including stratification by domain and optimization stage to account for scenario-specific confounds. revision: yes

  2. Referee: [Shor dataset description] Shor dataset and experimental setup: both ranking and multi-step optimization results are obtained on (subsets of) the same 182 human-verified scenarios. This design leaves open whether the observed correlation reflects genuine optimizer skill or patterns that human verifiers preferentially selected; a hold-out evaluation on unseen harnesses or domains is needed to substantiate the 'reliable predictor' conclusion.

    Authors: We acknowledge the potential concern about using the same scenarios. However, the ground-truth rankings were produced by human verifiers who had no access to optimizer outputs or performance data, and the 182 scenarios were selected to cover diverse domains, designs, and stages. This construction reduces the chance that the correlation arises merely from verifier-selected patterns. We therefore maintain that the reported correlation provides evidence for priority ranking as a predictor, though we recognize the value of future hold-out studies on entirely new harnesses. revision: no

  3. Referee: [Priority ranking design] Methods for priority ranking: the procedure for selecting which components to rank and the precise scoring rubric used to obtain ground-truth rankings from human verification must be detailed to confirm that the metric is independent of prior optimizer performance and not circular.

    Authors: We agree that greater methodological detail is warranted. In the revised manuscript we have expanded the Methods section to specify that all harness components (tools, prompts, and other elements) are ranked, and that human verifiers assign integer impact scores from -2 to +2 reflecting expected performance change before deriving the ground-truth order. This scoring occurs independently of any optimizer and prior to optimizer evaluation, ensuring the metric is non-circular. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines priority ranking as an independent, low-cost direct evaluation protocol that asks optimizers to rank harness components by improvement potential, then separately measures correlation against multi-step optimization gains on the Shor collection of 182 human-verified scenarios. No equations, fitted parameters, or self-citations are shown that reduce the claimed correlation to a definitional identity or to the same fitted values used as input. Shor functions as an external benchmark rather than a self-referential loop, and the central claim retains independent empirical content even if both metrics are computed on subsets of the same scenarios. The derivation chain therefore does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human-verified scenarios capture meaningful optimization steps and that ranking by potential impact is a valid proxy without additional fitted parameters or new entities beyond the dataset itself.

axioms (1)
  • domain assumption Human verification ensures the 182 scenarios accurately represent real harness optimization challenges across domains and stages.
    Invoked to justify using Shor as the basis for priority ranking tests.
invented entities (1)
  • Shor dataset no independent evidence
    purpose: Collection of 182 human-verified optimization scenarios to enable priority ranking evaluations.
    New benchmark introduced to support the direct evaluation method.

pith-pipeline@v0.9.0 · 5815 in / 1231 out tokens · 31612 ms · 2026-05-22T06:09:06.582566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 15 internal anchors

  1. [1]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  2. [2]

    Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code

    Hyungjoo Chae, Taeyoon Kwon, Seungjun Moon, Yongho Song, Dongjin Kang, Kai Tzu-iunn Ong, Beong-woo Kwak, Seonghyeon Bae, Seung-won Hwang, and Jinyoung Yeo. Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference o...

  3. [3]

    doi: 10.18653/v1/2024.emnlp-main.1254

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1254. URL https://aclanthology.org/2024.emnlp-main.1254/

  4. [4]

    ISBN 979-8-89176-251-0

    Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, Gangyang Li, Xu Zhu, Weiming Zhang, and Nenghai Yu. CompileAgent: Automated real-world repo-level compilation with tool- integrated LLM-based agent system. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Asso...

  5. [5]

    URLhttps://aclanthology.org/2025.acl-long.103/

  6. [6]

    Towards lifelong dialogue agents via timeline-based memory management

    Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung- won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comput...

  7. [8]

    Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026

    Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026

  8. [9]

    Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025

    Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, and Jinyoung Yeo. Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025

  9. [10]

    Harness design for long-running application development, Mar 2026

    Anthropic. Harness design for long-running application development, Mar 2026. URLhttps://www. anthropic.com/engineering/harness-design-long-running-apps

  10. [11]

    Web agents with world models: Learning and leveraging environment dynamics in web navigation

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InThe Thirteenth International Conference on Learning Representations, 2025

  11. [12]

    ReCreate: Reasoning and Creating Domain Agents Driven by Experience

    Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026. 12 Direct Evaluation of Harness Optimizers via Priority Ranking

  12. [13]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

  13. [14]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

  14. [15]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  15. [16]

    Dynamic cheatsheet: Test-time learning with adaptive memory

    Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume1: LongPapers),pages7080–7106,Rabat,Morocco...

  16. [17]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

  17. [18]

    VeRO: An Evaluation Harness for Agents to Optimize Agents

    Varun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton, et al. Vero: An evaluation harness for agents to optimize agents.arXiv preprint arXiv:2602.22480, 2026

  18. [19]

    Hyperagents.arXiv preprint arXiv:2603.19461, 2026

    Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

  19. [20]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  20. [21]

    Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025

  21. [22]

    Large language models as optimizers

    Chengrun Yang, XuezhiWang, Yifeng Lu, Hanxiao Liu, Quoc VLe, Denny Zhou, and XinyunChen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  22. [23]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  23. [24]

    ThetaEvolve: Test-time Learning on Open Problems

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

  24. [25]

    Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026. 13 Direct Evaluation of Harness Optimizers via Priority Ranking

  25. [26]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  26. [27]

    Expel: Llm agentsareexperientiallearners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agentsareexperientiallearners. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume38, pages 19632–19642, 2024

  27. [28]

    PRINCIPLES: Synthetic strategy memory for proactive dialogue agents

    NamyoungKim, KaiTzu-iunnOng, YeonjunHwang, MinseokKang, IiseoJihn, GayoungKim, MinjuKim, and Jinyoung Yeo. PRINCIPLES: Synthetic strategy memory for proactive dialogue agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21329–21368, ...

  28. [29]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025. findings-emnlp.1164. URLhttps://aclanthology.org/2025.findings-emnlp.1164/

  29. [30]

    CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

    Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

  30. [31]

    PhD thesis, Maastricht University, 2010

    Guillaume Chaslot.Monte-carlo tree search. PhD thesis, Maastricht University, 2010

  31. [32]

    Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

    Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

  32. [33]

    Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

    Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min- Ling Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

  33. [34]

    Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

    Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

  34. [35]

    Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

    Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

  35. [36]

    From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026

    Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, and Shaowu Pan. From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026

  36. [37]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  37. [38]

    Claude code by anthropic, 2026

    Anthropic. Claude code by anthropic, 2026. URL https://www.anthropic.com/product/ claude-code

  38. [39]

    Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

    Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024. 14 Direct Evaluation of Harness Optimizers via Priority Ranking

  39. [40]

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  40. [41]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  41. [42]

    Introducing swe-bench verified, 2024

    OpenAI. Introducing swe-bench verified, 2024. URLhttps://openai.com/index/introducing- swe-bench-verified/

  42. [43]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  43. [44]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  44. [45]

    Introducing codex, 2025

    OpenAI. Introducing codex, 2025. URLhttps://openai.com/index/introducing-codex/

  45. [46]

    Gpt-5.3-codex, 2026

    OpenAI. Gpt-5.3-codex, 2026. URL https://openai.com/index/introducing-gpt-5-3- codex/

  46. [47]

    Introducing claude sonnet 4.6, 2026

    Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

  47. [48]

    Gemini cli, 2026

    Gemini CLI. Gemini cli, 2026. URLhttps://geminicli.com/

  48. [49]

    Gemini 3.1 pro, 2026

    Google Deepmind. Gemini 3.1 pro, 2026. URLhttps://deepmind.google/models/gemini/ pro/

  49. [50]

    mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025

    Kilian Lieret et al. mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025. GitHub repository

  50. [51]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  51. [52]

    AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025

    Jiabin Tang, Tianyu Fan, and Chao Huang. AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025. URLhttps://arxiv.org/abs/2502.05957

  52. [53]

    Y our agent may misevolve: Emergent risks in self-evolving LLM agents

    Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

  53. [54]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  54. [55]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024. 15 Direct Evaluation of Harness Optimizers via Priority Ranking

  55. [56]

    System card: Claude Haiku 4.5

    Anthropic. System card: Claude Haiku 4.5. Technical report, Anthropic, October 2025. URLhttps: //anthropic.com/claude-haiku-4-5-system-card

  56. [57]

    System card: Claude Sonnet 4.6

    Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

  57. [58]

    Introducing GPT-4.1 in the api, April 2025

    OpenAI. Introducing GPT-4.1 in the api, April 2025. URLhttps://openai.com/index/gpt-4-1/

  58. [59]

    OpenAI GPT-5 system card, August 2025

    OpenAI. OpenAI GPT-5 system card, August 2025. URLhttps://openai.com/index/gpt-5- system-card/

  59. [60]

    Update to GPT-5 system card: GPT-5.2, December 2025

    OpenAI. Update to GPT-5 system card: GPT-5.2, December 2025. URLhttps://openai.com/ index/gpt-5-system-card-update-gpt-5-2/

  60. [61]

    GPT-5.5 system card, April 2026

    OpenAI. GPT-5.5 system card, April 2026. URLhttps://openai.com/index/gpt-5-5-system- card/

  61. [62]

    Gemini 3 Flash model card

    Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December

  62. [63]

    URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3- Flash-Model-Card.pdf

  63. [64]

    Gemini 3 Pro model card

    Google DeepMind. Gemini 3 Pro model card. Technical report, Google DeepMind, November 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro- Model-Card.pdf

  64. [65]

    DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

  65. [66]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  66. [67]

    Qwen3.6-Plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6

  67. [68]

    Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026

    Moonshot AI. Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026. URL https://www.kimi.com/blog/kimi-k2-6

  68. [69]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/ abs/2602.15763

  69. [70]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  70. [71]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 16 Direct Evaluation of Harness Optimizers via Priority Ranking A. Appendix Contents •Limitations: Appendix B •Details on Analyses in Section 3: App...

  71. [72]

    Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring

    Memory path layout:shared_memory_path_for_run(run_dir)→ <dataset_output>/memory/<agent>/<model>/memory.json. Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring. •FailureMemory— store onlywas_correct: falserecords for lesson retrieval. •EntityResolutionMemory— cache canonical entity→URL mappings encountered before. •Redesign ...

  72. [73]

    Read base — note current retriever type,_store_memory_resultshape

  73. [74]

    Read trajectory — focus on cases where prior-task memory should have helped but didn’t

  74. [75]

    Minimal supporting edits only

    Edit memory layer. Minimal supporting edits only

  75. [76]

    component

    Write meta:{"component": "memory", "coding_agent": "...", "base": "...", "hypothesis": "...", ...} ## REMINDERS Exactly ONE upgraded harness. Primary axis MEMORY. No task-specific hints. Figure 15.Optimizer prompt template for the Memory axis. 35 Direct Evaluation of Harness Optimizers via Priority Ranking Optimizer Prompt Template (skills.md) (Prompt) ##...

  76. [77]

    Prompt assembly in_build_agent_config(work_dir, problem)withextra_template_vars

  77. [78]

    re-check sources agree before Submitting

    Memory/tools rendering inside the system prompt ({memory_context},{tool_list}). Valid PROMPT upgrades: •Stricter final-answer format rule (short exact strings, no explanations). •Decompose-question-first scaffold (multi-hop decomposition before search). •Verify-before-commit rule (“re-check sources agree before Submitting”). •“If conflicting sources, fetc...

  78. [79]

    Read base — note currentSYSTEM_TEMPLATE,INSTANCE_TEMPLATE

  79. [80]

    Read trajectory — focus onwas_correct: false+ answer-format mismatches / wrong entity picks

  80. [81]

    Minimal supporting edits only

    Edit prompt strings. Minimal supporting edits only

Showing first 80 references.