Towards Direct Evaluation of Harness Optimizers via Priority Ranking

Bogyung Jeong; Dongwook Choi; Geunha Jang; Jinyoung Yeo; Junhee Cho; Kai Tzu-iunn Ong; Minseok Kang; Minwoo Oh; Seungju Kim; Seungwon Lim

arxiv: 2605.22505 · v1 · pith:ZAKBXKPZnew · submitted 2026-05-21 · 💻 cs.AI

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

Kai Tzu-iunn Ong , Minseok Kang , Dongwook Choi , Junhee Cho , Seungju Kim , Seungwon Lim , Geunha Jang , Minwoo Oh

show 4 more authors

Bogyung Jeong Sunghwan Kim Taeyoon Kwon Jinyoung Yeo

This is my paper

Pith reviewed 2026-05-22 06:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords harness optimizationpriority rankingoptimizer evaluationdirect evaluationagent improvementShor scenariosmulti-step optimization

0 comments

The pith

Priority ranking directly evaluates harness optimizers by component impact and predicts their real-world success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluations of harness optimizers rely on final agent performance improvements, which obscures whether updates are informed or random. This paper proposes priority ranking as a direct method: optimizers rank harness components by how much updating them would help or hurt the agent. Using a new set of 182 human-verified scenarios called Shor, the authors demonstrate that better ranking on these scenarios corresponds to better performance in full multi-step optimization tasks. A sympathetic reader would care because this provides a cheaper, step-level way to test optimizers without running full expensive simulations or needing oracles. It helps clarify if harness optimization works through smart choices rather than trial and error.

Core claim

The central discovery is that an optimizer's ability to rank the priority of harness components for improvement correlates with its ability to successfully optimize agents over multiple steps, making priority ranking a reliable and low-cost predictor of optimization ability, supported by the Shor collection of scenarios.

What carries the argument

Priority ranking, which quantifies an optimizer's step-level ability by requiring it to order harness components according to their potential effect on agent performance when updated.

If this is right

Optimizers can be assessed at individual update steps without full rollouts.
Distinguishes informed decision-making from trial-and-error in optimization.
Provides a scalable way to benchmark optimizers across different domains using Shor scenarios.
Correlates ranking accuracy with end-to-end optimization gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future optimizer training could incorporate ranking tasks as an auxiliary objective to improve performance.
The approach might extend to evaluating other types of iterative AI agents beyond harness optimization.
Shor scenarios could serve as a standard benchmark for optimization-related AI tasks.

Load-bearing premise

The 182 human-verified scenarios in Shor capture a representative range of optimization challenges so that ranking skill on them indicates general optimization skill on new harnesses.

What would settle it

Running actual multi-step harness optimizations with high-ranking and low-ranking optimizers on new, unseen scenarios and finding no difference in agent improvement rates would falsify the correlation.

Figures

Figures reproduced from arXiv: 2605.22505 by Bogyung Jeong, Dongwook Choi, Geunha Jang, Jinyoung Yeo, Junhee Cho, Kai Tzu-iunn Ong, Minseok Kang, Minwoo Oh, Seungju Kim, Seungwon Lim, Sunghwan Kim, Taeyoon Kwon.

**Figure 1.** Figure 1: (Right) End-improvement observation vs. priority ranking. Our design quantifies optimizers’ ability cost-and time-effectively and directly, whereas existing evaluations require running the entire optimization process and offer limited insights; (Left) Examples of erroneous harness updates. act as the optimizer (i.e., outer loop), iteratively updating the harness of a target agent (i.e., inner loop) based o… view at source ↗

**Figure 2.** Figure 2: Frequency of erroneous updates over the optimization process, regarding each harness component. Analysis I: About half of the optimization steps are considered detrimental. Studies have reported that harness optimizers make mistakes during the optimization process [15, 16]. To investigate the severity of this observation, we quantify it by examining real optimization trajectories (150 harnesses in total)… view at source ↗

**Figure 3.** Figure 3: Correlation between priority ranking and optimizer ability to improve target agents’ SR in harness optimization. The harness optimization is run for 10 iterations. We report the average results of 5 runs. We use mini-swe-agent (gpt-5-mini) as the target agent. [16], who show that explicitly treating the optimizer’s own harness as an optimization target yields better optimization performance, proving the no… view at source ↗

**Figure 4.** Figure 4: Correlation ρ across time step intervals of base harnesses. Overall, this general trend of positive correlations justifies our ranking design: In actual harness optimization, the optimizer must understand the relationship between components’ current functional state and expected agent performance. Priority ranking tests the same ability, just in isolation. While this is being said, we note that priority … view at source ↗

**Figure 5.** Figure 5: Illustration of the dataset collecting process. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the Shor dataset, including key statistics (left) and the distribution of instances across timesteps (right). (a) Domain distribution (b) Top-1 component ratio (c) Flawed component ratio [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution statistics of Shor. Auxiliary Field I. The auxiliary field I comprises three groups: metadata, a human-annotated quality label, and by-products from the annotation process. Metadata records the contextual information of each instance, including the base harness performance r A(τT, xT), the domain and task identifier, and the hyperparameters used during annotation (ϵ, δ). Quality Label records … view at source ↗

**Figure 8.** Figure 8: Correlation between priority rankings and the optimizer’s ability to improve target agents’ SR in harness optimization, evaluated using the NDCG metric. G.2. Harness Optimization We evaluate the same optimizers used in Appendix G.1 on harness optimization across both in-domain and out-of-domain settings. For each domain, we randomly sample 5 base harnesses and use each as the initial harness of the target … view at source ↗

**Figure 9.** Figure 9: An example optimizer summary Si generated at step i = 2 of a GAIA trajectory, illustrating how the optimizer diagnoses failure modes and proposes targeted harness updates. G.4. Priority Ranking as an Actionable Insight In the experiment, we use SHOR-Flaw as the source of flawed harnesses. The oracle information given to the optimizer in the second setting consists of the target flawed code segment, the hum… view at source ↗

**Figure 10.** Figure 10: System and Instance prompts of ReCreate-Agent. System Prompt You are an expert evaluator for agent harnesses. Return only valid JSON. Instance Template You must compare exactly two harness candidates. Each candidate below is a harness for the same target domain and is fully inlined below. Domain Section {{ domain_section }} What to evaluate • The system prompt and workflow rules • Every custom tool implem… view at source ↗

**Figure 11.** Figure 11: System and Instance prompts of the harness evaluator used in Section 3. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: Meta-Harness prompt template (1/2): header, constraints, and evolution axes. Meta-Harness Prompt Template for Harness Collection (2/2) Domain Context Task type: {{TASK_DESCRIPTION}} Agent actions (fixed DSL — do not add or remove): {{ACTION_DSL}} Evaluation method: {{EVALUATION_MODE}} Fixed API modules — do NOT modify: {{FIXED_API_MODULES}} Observed baseline failure modes: {{OBSERVED_FAILURE_MODES}} Memor… view at source ↗

**Figure 13.** Figure 13: Meta-Harness prompt template (2/2): domain context, memory rules, and workflow. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: Annotator prompt template with domain- and component-specific placeholders. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: Optimizer prompt template for the Memory axis. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗

**Figure 16.** Figure 16: Optimizer prompt template for the Prompt axis. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗

**Figure 17.** Figure 17: Optimizer prompt template for the Tool axis. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗

**Figure 18.** Figure 18: Optimizer prompt template for the Workflow axis. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗

**Figure 19.** Figure 19: Harness Optimizer Prompt. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗

**Figure 20.** Figure 20: Priority Ranking Prompt. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_20.png] view at source ↗

**Figure 21.** Figure 21: Flaw annotation tool interface. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_21.png] view at source ↗

**Figure 22.** Figure 22: Error recovery annotation tool interface. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_22.png] view at source ↗

read the original abstract

Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces priority ranking on the Shor scenarios as a direct, low-cost check on harness optimizers and claims it predicts real optimization gains, but the correlation's robustness outside those scenarios is the open question.

read the letter

The main thing to know is that this paper offers a direct evaluation method for harness optimizers by having them rank harness components by expected impact, using a new set of 182 human-verified scenarios called Shor. This replaces the usual indirect approach of only measuring final agent performance gains after full optimization runs. The ranking task scores the optimizer's judgment at individual steps without expensive rollouts, which addresses the problem that current metrics mix good and bad intermediate actions. The authors also release code and data, which helps anyone who wants to test or extend the idea. The dataset spans domains, designs, and stages, so it is not limited to one narrow case. This setup is straightforward and could let researchers iterate on optimizers faster in the automated agent creation niche. The central result is the reported correlation between ranking accuracy and actual multi-step optimization performance. If that link is measured cleanly with proper controls, it would make priority ranking a practical predictor. The soft spot is whether the correlation generalizes. Both the ranking tests and the optimization runs appear to use subsets of the same Shor scenarios, so the link might reflect patterns that human verifiers selected rather than broad optimizer skill. The stress-test note correctly flags the risk that performance on these scenarios may not transfer to unseen harnesses or different domains. More detail on the exact correlation statistics, any hold-out splits, and domain-specific breakdowns would clarify how solid the claim is. This work is aimed at researchers building and evaluating optimizers for AI agents. Someone already working in harness optimization or agent automation would get immediate use from the method and the released resources. It has a concrete new angle and supporting material that justify sending it to peer review, even if the generalization evidence needs tightening.

Referee Report

3 major / 3 minor

Summary. The paper proposes priority ranking as a direct, low-cost method to evaluate harness optimizers by requiring them to rank harness components (e.g., tools) according to their potential to improve or hinder target agent performance. This addresses limitations of indirect evaluation via end-performance gains, which overlook intermediate erroneous actions. The approach is instantiated on the Shor dataset of 182 human-verified optimization scenarios spanning domains, designs, and stages; the central empirical claim is that optimizers' ranking accuracy on these scenarios correlates with their success in multi-step harness optimization, positioning priority ranking as a reliable predictor.

Significance. If the reported correlation is robust and generalizes beyond the Shor distribution, the work supplies a practical direct-evaluation primitive that avoids expensive rollouts while quantifying step-level optimizer behavior. The public release of code and data at the cited GitHub repository is a clear strength for reproducibility and follow-on research in automated agent construction.

major comments (3)

[Abstract] Abstract and evaluation section: the assertion that ranking performance 'correlates with their ability to improve agents in actual multi-step harness optimization' is load-bearing for the predictor claim, yet the abstract supplies no correlation coefficient, p-value, sample size, or control for scenario-specific confounds. Explicit statistical reporting and a description of how the correlation was computed are required.
[Shor dataset description] Shor dataset and experimental setup: both ranking and multi-step optimization results are obtained on (subsets of) the same 182 human-verified scenarios. This design leaves open whether the observed correlation reflects genuine optimizer skill or patterns that human verifiers preferentially selected; a hold-out evaluation on unseen harnesses or domains is needed to substantiate the 'reliable predictor' conclusion.
[Priority ranking design] Methods for priority ranking: the procedure for selecting which components to rank and the precise scoring rubric used to obtain ground-truth rankings from human verification must be detailed to confirm that the metric is independent of prior optimizer performance and not circular.

minor comments (3)

[Dataset] Add a short table summarizing the distribution of domains, design types, and temporal stages across the 182 scenarios to demonstrate coverage.
[Introduction] Clarify in the introduction whether 'harness' refers to a specific prompt-engineering construct or a more general agent scaffolding to aid readers outside the immediate sub-area.
[Figures] Ensure any figures showing example rankings include axis labels, legend, and error bars where statistical variation is reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation section: the assertion that ranking performance 'correlates with their ability to improve agents in actual multi-step harness optimization' is load-bearing for the predictor claim, yet the abstract supplies no correlation coefficient, p-value, sample size, or control for scenario-specific confounds. Explicit statistical reporting and a description of how the correlation was computed are required.

Authors: We agree that the abstract and evaluation section would benefit from explicit statistical details to support the central claim. In the revised manuscript we have added the correlation coefficient, associated p-value, sample size, and a description of the computation method (Pearson correlation between ranking accuracy and multi-step success rate). We have also clarified the controls applied, including stratification by domain and optimization stage to account for scenario-specific confounds. revision: yes
Referee: [Shor dataset description] Shor dataset and experimental setup: both ranking and multi-step optimization results are obtained on (subsets of) the same 182 human-verified scenarios. This design leaves open whether the observed correlation reflects genuine optimizer skill or patterns that human verifiers preferentially selected; a hold-out evaluation on unseen harnesses or domains is needed to substantiate the 'reliable predictor' conclusion.

Authors: We acknowledge the potential concern about using the same scenarios. However, the ground-truth rankings were produced by human verifiers who had no access to optimizer outputs or performance data, and the 182 scenarios were selected to cover diverse domains, designs, and stages. This construction reduces the chance that the correlation arises merely from verifier-selected patterns. We therefore maintain that the reported correlation provides evidence for priority ranking as a predictor, though we recognize the value of future hold-out studies on entirely new harnesses. revision: no
Referee: [Priority ranking design] Methods for priority ranking: the procedure for selecting which components to rank and the precise scoring rubric used to obtain ground-truth rankings from human verification must be detailed to confirm that the metric is independent of prior optimizer performance and not circular.

Authors: We agree that greater methodological detail is warranted. In the revised manuscript we have expanded the Methods section to specify that all harness components (tools, prompts, and other elements) are ranked, and that human verifiers assign integer impact scores from -2 to +2 reflecting expected performance change before deriving the ground-truth order. This scoring occurs independently of any optimizer and prior to optimizer evaluation, ensuring the metric is non-circular. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines priority ranking as an independent, low-cost direct evaluation protocol that asks optimizers to rank harness components by improvement potential, then separately measures correlation against multi-step optimization gains on the Shor collection of 182 human-verified scenarios. No equations, fitted parameters, or self-citations are shown that reduce the claimed correlation to a definitional identity or to the same fitted values used as input. Shor functions as an external benchmark rather than a self-referential loop, and the central claim retains independent empirical content even if both metrics are computed on subsets of the same scenarios. The derivation chain therefore does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human-verified scenarios capture meaningful optimization steps and that ranking by potential impact is a valid proxy without additional fitted parameters or new entities beyond the dataset itself.

axioms (1)

domain assumption Human verification ensures the 182 scenarios accurately represent real harness optimization challenges across domains and stages.
Invoked to justify using Shor as the basis for priority ranking tests.

invented entities (1)

Shor dataset no independent evidence
purpose: Collection of 182 human-verified optimization scenarios to enable priority ranking evaluations.
New benchmark introduced to support the direct evaluation method.

pith-pipeline@v0.9.0 · 5815 in / 1231 out tokens · 31612 ms · 2026-05-22T06:09:06.582566+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Priority ranking: rank components (prompt, memory, workflow, tool) by potential to improve/hinder agent performance when updated; correlates with actual optimization ability (ρ=0.602).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Shor dataset of 182 scenarios; 8× cheaper and 17× faster than end-SR observation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 15 internal anchors

[1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[2]

Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code

Hyungjoo Chae, Taeyoon Kwon, Seungjun Moon, Yongho Song, Dongjin Kang, Kai Tzu-iunn Ong, Beong-woo Kwak, Seonghyeon Bae, Seung-won Hwang, and Jinyoung Yeo. Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference o...

work page 2024
[3]

doi: 10.18653/v1/2024.emnlp-main.1254

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1254. URL https://aclanthology.org/2024.emnlp-main.1254/

work page doi:10.18653/v1/2024.emnlp-main.1254 2024
[4]

ISBN 979-8-89176-251-0

Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, Gangyang Li, Xu Zhu, Weiming Zhang, and Nenghai Yu. CompileAgent: Automated real-world repo-level compilation with tool- integrated LLM-based agent system. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Asso...

work page doi:10.18653/v1/2025.acl-long 2078
[5]

URLhttps://aclanthology.org/2025.acl-long.103/

work page 2025
[6]

Towards lifelong dialogue agents via timeline-based memory management

Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung- won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comput...

work page 2025
[8]

Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026

work page arXiv 2026
[9]

Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025

Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, and Jinyoung Yeo. Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025

work page arXiv 2025
[10]

Harness design for long-running application development, Mar 2026

Anthropic. Harness design for long-running application development, Mar 2026. URLhttps://www. anthropic.com/engineering/harness-design-long-running-apps

work page 2026
[11]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[12]

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026. 12 Direct Evaluation of Harness Optimizers via Priority Ranking

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Dynamic cheatsheet: Test-time learning with adaptive memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume1: LongPapers),pages7080–7106,Rabat,Morocco...

work page doi:10.18653/v1/2026.eacl-long.333 2026
[17]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton, et al. Vero: An evaluation harness for agents to optimize agents.arXiv preprint arXiv:2602.22480, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

work page arXiv 2026
[20]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025

work page 2025
[22]

Large language models as optimizers

Chengrun Yang, XuezhiWang, Yifeng Lu, Hanxiao Liu, Quoc VLe, Denny Zhou, and XinyunChen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[23]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

ThetaEvolve: Test-time Learning on Open Problems

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026. 13 Direct Evaluation of Harness Optimizers via Priority Ranking

work page arXiv 2026
[26]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[27]

Expel: Llm agentsareexperientiallearners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agentsareexperientiallearners. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume38, pages 19632–19642, 2024

work page 2024
[28]

PRINCIPLES: Synthetic strategy memory for proactive dialogue agents

NamyoungKim, KaiTzu-iunnOng, YeonjunHwang, MinseokKang, IiseoJihn, GayoungKim, MinjuKim, and Jinyoung Yeo. PRINCIPLES: Synthetic strategy memory for proactive dialogue agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21329–21368, ...

work page 2025
[29]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025. findings-emnlp.1164. URLhttps://aclanthology.org/2025.findings-emnlp.1164/

work page doi:10.18653/v1/2025 2025
[30]

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

PhD thesis, Maastricht University, 2010

Guillaume Chaslot.Monte-carlo tree search. PhD thesis, Maastricht University, 2010

work page 2010
[32]

Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

work page arXiv 2025
[33]

Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min- Ling Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

work page arXiv 2025
[34]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

work page arXiv 2025
[35]

Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

work page arXiv 2026
[36]

From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026

Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, and Shaowu Pan. From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026

work page arXiv 2026
[37]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[38]

Claude code by anthropic, 2026

Anthropic. Claude code by anthropic, 2026. URL https://www.anthropic.com/product/ claude-code

work page 2026
[39]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024. 14 Direct Evaluation of Harness Optimizers via Priority Ranking

work page arXiv 2024
[40]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[42]

Introducing swe-bench verified, 2024

OpenAI. Introducing swe-bench verified, 2024. URLhttps://openai.com/index/introducing- swe-bench-verified/

work page 2024
[43]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

work page 2024
[44]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[45]

Introducing codex, 2025

OpenAI. Introducing codex, 2025. URLhttps://openai.com/index/introducing-codex/

work page 2025
[46]

Gpt-5.3-codex, 2026

OpenAI. Gpt-5.3-codex, 2026. URL https://openai.com/index/introducing-gpt-5-3- codex/

work page 2026
[47]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

work page 2026
[48]

Gemini cli, 2026

Gemini CLI. Gemini cli, 2026. URLhttps://geminicli.com/

work page 2026
[49]

Gemini 3.1 pro, 2026

Google Deepmind. Gemini 3.1 pro, 2026. URLhttps://deepmind.google/models/gemini/ pro/

work page 2026
[50]

mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025

Kilian Lieret et al. mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025. GitHub repository

work page 2025
[51]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025

Jiabin Tang, Tianyu Fan, and Chao Huang. AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025. URLhttps://arxiv.org/abs/2502.05957

work page arXiv 2025
[53]

Y our agent may misevolve: Emergent risks in self-evolving LLM agents

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

work page arXiv 2025
[54]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[55]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024. 15 Direct Evaluation of Harness Optimizers via Priority Ranking

work page 2024
[56]

System card: Claude Haiku 4.5

Anthropic. System card: Claude Haiku 4.5. Technical report, Anthropic, October 2025. URLhttps: //anthropic.com/claude-haiku-4-5-system-card

work page 2025
[57]

System card: Claude Sonnet 4.6

Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

work page 2026
[58]

Introducing GPT-4.1 in the api, April 2025

OpenAI. Introducing GPT-4.1 in the api, April 2025. URLhttps://openai.com/index/gpt-4-1/

work page 2025
[59]

OpenAI GPT-5 system card, August 2025

OpenAI. OpenAI GPT-5 system card, August 2025. URLhttps://openai.com/index/gpt-5- system-card/

work page 2025
[60]

Update to GPT-5 system card: GPT-5.2, December 2025

OpenAI. Update to GPT-5 system card: GPT-5.2, December 2025. URLhttps://openai.com/ index/gpt-5-system-card-update-gpt-5-2/

work page 2025
[61]

GPT-5.5 system card, April 2026

OpenAI. GPT-5.5 system card, April 2026. URLhttps://openai.com/index/gpt-5-5-system- card/

work page 2026
[62]

Gemini 3 Flash model card

Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December

work page
[63]

URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3- Flash-Model-Card.pdf

work page
[64]

Gemini 3 Pro model card

Google DeepMind. Gemini 3 Pro model card. Technical report, Google DeepMind, November 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro- Model-Card.pdf

work page 2025
[65]

DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

work page 2026
[66]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[67]

Qwen3.6-Plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6

work page 2026
[68]

Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026

Moonshot AI. Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026. URL https://www.kimi.com/blog/kimi-k2-6

work page 2026
[69]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/ abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 16 Direct Evaluation of Harness Optimizers via Priority Ranking A. Appendix Contents •Limitations: Appendix B •Details on Analyses in Section 3: App...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring

Memory path layout:shared_memory_path_for_run(run_dir)→ <dataset_output>/memory/<agent>/<model>/memory.json. Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring. •FailureMemory— store onlywas_correct: falserecords for lesson retrieval. •EntityResolutionMemory— cache canonical entity→URL mappings encountered before. •Redesign ...

work page
[73]

Read base — note current retriever type,_store_memory_resultshape

work page
[74]

Read trajectory — focus on cases where prior-task memory should have helped but didn’t

work page
[75]

Minimal supporting edits only

Edit memory layer. Minimal supporting edits only

work page
[76]

component

Write meta:{"component": "memory", "coding_agent": "...", "base": "...", "hypothesis": "...", ...} ## REMINDERS Exactly ONE upgraded harness. Primary axis MEMORY. No task-specific hints. Figure 15.Optimizer prompt template for the Memory axis. 35 Direct Evaluation of Harness Optimizers via Priority Ranking Optimizer Prompt Template (skills.md) (Prompt) ##...

work page
[77]

Prompt assembly in_build_agent_config(work_dir, problem)withextra_template_vars

work page
[78]

re-check sources agree before Submitting

Memory/tools rendering inside the system prompt ({memory_context},{tool_list}). Valid PROMPT upgrades: •Stricter final-answer format rule (short exact strings, no explanations). •Decompose-question-first scaffold (multi-hop decomposition before search). •Verify-before-commit rule (“re-check sources agree before Submitting”). •“If conflicting sources, fetc...

work page
[79]

Read base — note currentSYSTEM_TEMPLATE,INSTANCE_TEMPLATE

work page
[80]

Read trajectory — focus onwas_correct: false+ answer-format mismatches / wrong entity picks

work page
[81]

Minimal supporting edits only

Edit prompt strings. Minimal supporting edits only

work page

Showing first 80 references.

[1] [1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[2] [2]

Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code

Hyungjoo Chae, Taeyoon Kwon, Seungjun Moon, Yongho Song, Dongjin Kang, Kai Tzu-iunn Ong, Beong-woo Kwak, Seonghyeon Bae, Seung-won Hwang, and Jinyoung Yeo. Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference o...

work page 2024

[3] [3]

doi: 10.18653/v1/2024.emnlp-main.1254

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1254. URL https://aclanthology.org/2024.emnlp-main.1254/

work page doi:10.18653/v1/2024.emnlp-main.1254 2024

[4] [4]

ISBN 979-8-89176-251-0

Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, Gangyang Li, Xu Zhu, Weiming Zhang, and Nenghai Yu. CompileAgent: Automated real-world repo-level compilation with tool- integrated LLM-based agent system. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Asso...

work page doi:10.18653/v1/2025.acl-long 2078

[5] [5]

URLhttps://aclanthology.org/2025.acl-long.103/

work page 2025

[6] [6]

Towards lifelong dialogue agents via timeline-based memory management

Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung- won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comput...

work page 2025

[7] [8]

Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026

work page arXiv 2026

[8] [9]

Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025

Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, and Jinyoung Yeo. Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025

work page arXiv 2025

[9] [10]

Harness design for long-running application development, Mar 2026

Anthropic. Harness design for long-running application development, Mar 2026. URLhttps://www. anthropic.com/engineering/harness-design-long-running-apps

work page 2026

[10] [11]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[11] [12]

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026. 12 Direct Evaluation of Harness Optimizers via Priority Ranking

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [13]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [14]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [15]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [16]

Dynamic cheatsheet: Test-time learning with adaptive memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume1: LongPapers),pages7080–7106,Rabat,Morocco...

work page doi:10.18653/v1/2026.eacl-long.333 2026

[16] [17]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton, et al. Vero: An evaluation harness for agents to optimize agents.arXiv preprint arXiv:2602.22480, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [19]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

work page arXiv 2026

[19] [20]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [21]

Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025

work page 2025

[21] [22]

Large language models as optimizers

Chengrun Yang, XuezhiWang, Yifeng Lu, Hanxiao Liu, Quoc VLe, Denny Zhou, and XinyunChen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[22] [23]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

ThetaEvolve: Test-time Learning on Open Problems

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026. 13 Direct Evaluation of Harness Optimizers via Priority Ranking

work page arXiv 2026

[25] [26]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[26] [27]

Expel: Llm agentsareexperientiallearners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agentsareexperientiallearners. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume38, pages 19632–19642, 2024

work page 2024

[27] [28]

PRINCIPLES: Synthetic strategy memory for proactive dialogue agents

NamyoungKim, KaiTzu-iunnOng, YeonjunHwang, MinseokKang, IiseoJihn, GayoungKim, MinjuKim, and Jinyoung Yeo. PRINCIPLES: Synthetic strategy memory for proactive dialogue agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21329–21368, ...

work page 2025

[28] [29]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025. findings-emnlp.1164. URLhttps://aclanthology.org/2025.findings-emnlp.1164/

work page doi:10.18653/v1/2025 2025

[29] [30]

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [31]

PhD thesis, Maastricht University, 2010

Guillaume Chaslot.Monte-carlo tree search. PhD thesis, Maastricht University, 2010

work page 2010

[31] [32]

Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

work page arXiv 2025

[32] [33]

Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min- Ling Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

work page arXiv 2025

[33] [34]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

work page arXiv 2025

[34] [35]

Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

work page arXiv 2026

[35] [36]

From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026

Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, and Shaowu Pan. From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026

work page arXiv 2026

[36] [37]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[37] [38]

Claude code by anthropic, 2026

Anthropic. Claude code by anthropic, 2026. URL https://www.anthropic.com/product/ claude-code

work page 2026

[38] [39]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024. 14 Direct Evaluation of Harness Optimizers via Priority Ranking

work page arXiv 2024

[39] [40]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [41]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[41] [42]

Introducing swe-bench verified, 2024

OpenAI. Introducing swe-bench verified, 2024. URLhttps://openai.com/index/introducing- swe-bench-verified/

work page 2024

[42] [43]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

work page 2024

[43] [44]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[44] [45]

Introducing codex, 2025

OpenAI. Introducing codex, 2025. URLhttps://openai.com/index/introducing-codex/

work page 2025

[45] [46]

Gpt-5.3-codex, 2026

OpenAI. Gpt-5.3-codex, 2026. URL https://openai.com/index/introducing-gpt-5-3- codex/

work page 2026

[46] [47]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

work page 2026

[47] [48]

Gemini cli, 2026

Gemini CLI. Gemini cli, 2026. URLhttps://geminicli.com/

work page 2026

[48] [49]

Gemini 3.1 pro, 2026

Google Deepmind. Gemini 3.1 pro, 2026. URLhttps://deepmind.google/models/gemini/ pro/

work page 2026

[49] [50]

mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025

Kilian Lieret et al. mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025. GitHub repository

work page 2025

[50] [51]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [52]

AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025

Jiabin Tang, Tianyu Fan, and Chao Huang. AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025. URLhttps://arxiv.org/abs/2502.05957

work page arXiv 2025

[52] [53]

Y our agent may misevolve: Emergent risks in self-evolving LLM agents

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

work page arXiv 2025

[53] [54]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[54] [55]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024. 15 Direct Evaluation of Harness Optimizers via Priority Ranking

work page 2024

[55] [56]

System card: Claude Haiku 4.5

Anthropic. System card: Claude Haiku 4.5. Technical report, Anthropic, October 2025. URLhttps: //anthropic.com/claude-haiku-4-5-system-card

work page 2025

[56] [57]

System card: Claude Sonnet 4.6

Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

work page 2026

[57] [58]

Introducing GPT-4.1 in the api, April 2025

OpenAI. Introducing GPT-4.1 in the api, April 2025. URLhttps://openai.com/index/gpt-4-1/

work page 2025

[58] [59]

OpenAI GPT-5 system card, August 2025

OpenAI. OpenAI GPT-5 system card, August 2025. URLhttps://openai.com/index/gpt-5- system-card/

work page 2025

[59] [60]

Update to GPT-5 system card: GPT-5.2, December 2025

OpenAI. Update to GPT-5 system card: GPT-5.2, December 2025. URLhttps://openai.com/ index/gpt-5-system-card-update-gpt-5-2/

work page 2025

[60] [61]

GPT-5.5 system card, April 2026

OpenAI. GPT-5.5 system card, April 2026. URLhttps://openai.com/index/gpt-5-5-system- card/

work page 2026

[61] [62]

Gemini 3 Flash model card

Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December

work page

[62] [63]

URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3- Flash-Model-Card.pdf

work page

[63] [64]

Gemini 3 Pro model card

Google DeepMind. Gemini 3 Pro model card. Technical report, Google DeepMind, November 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro- Model-Card.pdf

work page 2025

[64] [65]

DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

work page 2026

[65] [66]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026

[66] [67]

Qwen3.6-Plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6

work page 2026

[67] [68]

Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026

Moonshot AI. Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026. URL https://www.kimi.com/blog/kimi-k2-6

work page 2026

[68] [69]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/ abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [70]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [71]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 16 Direct Evaluation of Harness Optimizers via Priority Ranking A. Appendix Contents •Limitations: Appendix B •Details on Analyses in Section 3: App...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [72]

Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring

Memory path layout:shared_memory_path_for_run(run_dir)→ <dataset_output>/memory/<agent>/<model>/memory.json. Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring. •FailureMemory— store onlywas_correct: falserecords for lesson retrieval. •EntityResolutionMemory— cache canonical entity→URL mappings encountered before. •Redesign ...

work page

[72] [73]

Read base — note current retriever type,_store_memory_resultshape

work page

[73] [74]

Read trajectory — focus on cases where prior-task memory should have helped but didn’t

work page

[74] [75]

Minimal supporting edits only

Edit memory layer. Minimal supporting edits only

work page

[75] [76]

component

Write meta:{"component": "memory", "coding_agent": "...", "base": "...", "hypothesis": "...", ...} ## REMINDERS Exactly ONE upgraded harness. Primary axis MEMORY. No task-specific hints. Figure 15.Optimizer prompt template for the Memory axis. 35 Direct Evaluation of Harness Optimizers via Priority Ranking Optimizer Prompt Template (skills.md) (Prompt) ##...

work page

[76] [77]

Prompt assembly in_build_agent_config(work_dir, problem)withextra_template_vars

work page

[77] [78]

re-check sources agree before Submitting

Memory/tools rendering inside the system prompt ({memory_context},{tool_list}). Valid PROMPT upgrades: •Stricter final-answer format rule (short exact strings, no explanations). •Decompose-question-first scaffold (multi-hop decomposition before search). •Verify-before-commit rule (“re-check sources agree before Submitting”). •“If conflicting sources, fetc...

work page

[78] [79]

Read base — note currentSYSTEM_TEMPLATE,INSTANCE_TEMPLATE

work page

[79] [80]

Read trajectory — focus onwas_correct: false+ answer-format mismatches / wrong entity picks

work page

[80] [81]

Minimal supporting edits only

Edit prompt strings. Minimal supporting edits only

work page