Towards Direct Evaluation of Harness Optimizers via Priority Ranking
Pith reviewed 2026-05-22 06:09 UTC · model grok-4.3
The pith
Priority ranking directly evaluates harness optimizers by component impact and predicts their real-world success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that an optimizer's ability to rank the priority of harness components for improvement correlates with its ability to successfully optimize agents over multiple steps, making priority ranking a reliable and low-cost predictor of optimization ability, supported by the Shor collection of scenarios.
What carries the argument
Priority ranking, which quantifies an optimizer's step-level ability by requiring it to order harness components according to their potential effect on agent performance when updated.
If this is right
- Optimizers can be assessed at individual update steps without full rollouts.
- Distinguishes informed decision-making from trial-and-error in optimization.
- Provides a scalable way to benchmark optimizers across different domains using Shor scenarios.
- Correlates ranking accuracy with end-to-end optimization gains.
Where Pith is reading between the lines
- Future optimizer training could incorporate ranking tasks as an auxiliary objective to improve performance.
- The approach might extend to evaluating other types of iterative AI agents beyond harness optimization.
- Shor scenarios could serve as a standard benchmark for optimization-related AI tasks.
Load-bearing premise
The 182 human-verified scenarios in Shor capture a representative range of optimization challenges so that ranking skill on them indicates general optimization skill on new harnesses.
What would settle it
Running actual multi-step harness optimizations with high-ranking and low-ranking optimizers on new, unseen scenarios and finding no difference in agent improvement rates would falsify the correlation.
Figures
read the original abstract
Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes priority ranking as a direct, low-cost method to evaluate harness optimizers by requiring them to rank harness components (e.g., tools) according to their potential to improve or hinder target agent performance. This addresses limitations of indirect evaluation via end-performance gains, which overlook intermediate erroneous actions. The approach is instantiated on the Shor dataset of 182 human-verified optimization scenarios spanning domains, designs, and stages; the central empirical claim is that optimizers' ranking accuracy on these scenarios correlates with their success in multi-step harness optimization, positioning priority ranking as a reliable predictor.
Significance. If the reported correlation is robust and generalizes beyond the Shor distribution, the work supplies a practical direct-evaluation primitive that avoids expensive rollouts while quantifying step-level optimizer behavior. The public release of code and data at the cited GitHub repository is a clear strength for reproducibility and follow-on research in automated agent construction.
major comments (3)
- [Abstract] Abstract and evaluation section: the assertion that ranking performance 'correlates with their ability to improve agents in actual multi-step harness optimization' is load-bearing for the predictor claim, yet the abstract supplies no correlation coefficient, p-value, sample size, or control for scenario-specific confounds. Explicit statistical reporting and a description of how the correlation was computed are required.
- [Shor dataset description] Shor dataset and experimental setup: both ranking and multi-step optimization results are obtained on (subsets of) the same 182 human-verified scenarios. This design leaves open whether the observed correlation reflects genuine optimizer skill or patterns that human verifiers preferentially selected; a hold-out evaluation on unseen harnesses or domains is needed to substantiate the 'reliable predictor' conclusion.
- [Priority ranking design] Methods for priority ranking: the procedure for selecting which components to rank and the precise scoring rubric used to obtain ground-truth rankings from human verification must be detailed to confirm that the metric is independent of prior optimizer performance and not circular.
minor comments (3)
- [Dataset] Add a short table summarizing the distribution of domains, design types, and temporal stages across the 182 scenarios to demonstrate coverage.
- [Introduction] Clarify in the introduction whether 'harness' refers to a specific prompt-engineering construct or a more general agent scaffolding to aid readers outside the immediate sub-area.
- [Figures] Ensure any figures showing example rankings include axis labels, legend, and error bars where statistical variation is reported.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation section: the assertion that ranking performance 'correlates with their ability to improve agents in actual multi-step harness optimization' is load-bearing for the predictor claim, yet the abstract supplies no correlation coefficient, p-value, sample size, or control for scenario-specific confounds. Explicit statistical reporting and a description of how the correlation was computed are required.
Authors: We agree that the abstract and evaluation section would benefit from explicit statistical details to support the central claim. In the revised manuscript we have added the correlation coefficient, associated p-value, sample size, and a description of the computation method (Pearson correlation between ranking accuracy and multi-step success rate). We have also clarified the controls applied, including stratification by domain and optimization stage to account for scenario-specific confounds. revision: yes
-
Referee: [Shor dataset description] Shor dataset and experimental setup: both ranking and multi-step optimization results are obtained on (subsets of) the same 182 human-verified scenarios. This design leaves open whether the observed correlation reflects genuine optimizer skill or patterns that human verifiers preferentially selected; a hold-out evaluation on unseen harnesses or domains is needed to substantiate the 'reliable predictor' conclusion.
Authors: We acknowledge the potential concern about using the same scenarios. However, the ground-truth rankings were produced by human verifiers who had no access to optimizer outputs or performance data, and the 182 scenarios were selected to cover diverse domains, designs, and stages. This construction reduces the chance that the correlation arises merely from verifier-selected patterns. We therefore maintain that the reported correlation provides evidence for priority ranking as a predictor, though we recognize the value of future hold-out studies on entirely new harnesses. revision: no
-
Referee: [Priority ranking design] Methods for priority ranking: the procedure for selecting which components to rank and the precise scoring rubric used to obtain ground-truth rankings from human verification must be detailed to confirm that the metric is independent of prior optimizer performance and not circular.
Authors: We agree that greater methodological detail is warranted. In the revised manuscript we have expanded the Methods section to specify that all harness components (tools, prompts, and other elements) are ranked, and that human verifiers assign integer impact scores from -2 to +2 reflecting expected performance change before deriving the ground-truth order. This scoring occurs independently of any optimizer and prior to optimizer evaluation, ensuring the metric is non-circular. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper defines priority ranking as an independent, low-cost direct evaluation protocol that asks optimizers to rank harness components by improvement potential, then separately measures correlation against multi-step optimization gains on the Shor collection of 182 human-verified scenarios. No equations, fitted parameters, or self-citations are shown that reduce the claimed correlation to a definitional identity or to the same fitted values used as input. Shor functions as an external benchmark rather than a self-referential loop, and the central claim retains independent empirical content even if both metrics are computed on subsets of the same scenarios. The derivation chain therefore does not collapse by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification ensures the 182 scenarios accurately represent real harness optimization challenges across domains and stages.
invented entities (1)
-
Shor dataset
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Priority ranking: rank components (prompt, memory, workflow, tool) by potential to improve/hinder agent performance when updated; correlates with actual optimization ability (ρ=0.602).
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Shor dataset of 182 scenarios; 8× cheaper and 17× faster than end-SR observation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[2]
Hyungjoo Chae, Taeyoon Kwon, Seungjun Moon, Yongho Song, Dongjin Kang, Kai Tzu-iunn Ong, Beong-woo Kwak, Seonghyeon Bae, Seung-won Hwang, and Jinyoung Yeo. Coffee-gym: An en- vironment for evaluating and improving natural language feedback on erroneous code. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference o...
work page 2024
-
[3]
doi: 10.18653/v1/2024.emnlp-main.1254
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1254. URL https://aclanthology.org/2024.emnlp-main.1254/
-
[4]
Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, Gangyang Li, Xu Zhu, Weiming Zhang, and Nenghai Yu. CompileAgent: Automated real-world repo-level compilation with tool- integrated LLM-based agent system. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Asso...
-
[5]
URLhttps://aclanthology.org/2025.acl-long.103/
work page 2025
-
[6]
Towards lifelong dialogue agents via timeline-based memory management
Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung- won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comput...
work page 2025
-
[8]
Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540, 2026
-
[9]
Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, and Jinyoung Yeo. Embodied agents meet personalization: Investigating challenges and solutions through the lens of memory utilization.arXiv preprint arXiv:2505.16348, 2025
-
[10]
Harness design for long-running application development, Mar 2026
Anthropic. Harness design for long-running application development, Mar 2026. URLhttps://www. anthropic.com/engineering/harness-design-long-running-apps
work page 2026
-
[11]
Web agents with world models: Learning and leveraging environment dynamics in web navigation
Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[12]
ReCreate: Reasoning and Creating Domain Agents Driven by Experience
Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026. 12 Direct Evaluation of Harness Optimizers via Priority Ranking
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Dynamic cheatsheet: Test-time learning with adaptive memory
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume1: LongPapers),pages7080–7106,Rabat,Morocco...
-
[17]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton, et al. Vero: An evaluation harness for agents to optimize agents.arXiv preprint arXiv:2602.22480, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Hyperagents.arXiv preprint arXiv:2603.19461, 2026
Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026
-
[20]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639 (8055):609–616, 2025
work page 2025
-
[22]
Large language models as optimizers
Chengrun Yang, XuezhiWang, Yifeng Lu, Hanxiao Liu, Quoc VLe, Denny Zhou, and XinyunChen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[23]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
ThetaEvolve: Test-time Learning on Open Problems
Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [25]
-
[26]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[27]
Expel: Llm agentsareexperientiallearners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agentsareexperientiallearners. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume38, pages 19632–19642, 2024
work page 2024
-
[28]
PRINCIPLES: Synthetic strategy memory for proactive dialogue agents
NamyoungKim, KaiTzu-iunnOng, YeonjunHwang, MinseokKang, IiseoJihn, GayoungKim, MinjuKim, and Jinyoung Yeo. PRINCIPLES: Synthetic strategy memory for proactive dialogue agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21329–21368, ...
work page 2025
-
[29]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025. findings-emnlp.1164. URLhttps://aclanthology.org/2025.findings-emnlp.1164/
-
[30]
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
PhD thesis, Maastricht University, 2010
Guillaume Chaslot.Monte-carlo tree search. PhD thesis, Maastricht University, 2010
work page 2010
-
[32]
Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025
-
[33]
Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025
Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min- Ling Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025
-
[34]
Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025
Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025
-
[35]
Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026
-
[36]
Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, and Shaowu Pan. From static templates to dynamic runtime graphs: A survey of workflow optimization for llm agents.arXiv preprint arXiv:2603.22386, 2026
-
[37]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[38]
Claude code by anthropic, 2026
Anthropic. Claude code by anthropic, 2026. URL https://www.anthropic.com/product/ claude-code
work page 2026
-
[39]
Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024. 14 Direct Evaluation of Harness Optimizers via Priority Ranking
-
[40]
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[42]
Introducing swe-bench verified, 2024
OpenAI. Introducing swe-bench verified, 2024. URLhttps://openai.com/index/introducing- swe-bench-verified/
work page 2024
-
[43]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...
work page 2024
-
[44]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024
work page 2024
-
[45]
OpenAI. Introducing codex, 2025. URLhttps://openai.com/index/introducing-codex/
work page 2025
-
[46]
OpenAI. Gpt-5.3-codex, 2026. URL https://openai.com/index/introducing-gpt-5-3- codex/
work page 2026
-
[47]
Introducing claude sonnet 4.6, 2026
Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6
work page 2026
- [48]
-
[49]
Google Deepmind. Gemini 3.1 pro, 2026. URLhttps://deepmind.google/models/gemini/ pro/
work page 2026
-
[50]
Kilian Lieret et al. mini-SWE-agent: The 100 line AI agent that solves GitHub issues.https:// github.com/SWE-agent/mini-swe-agent, 2025. GitHub repository
work page 2025
-
[51]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025
Jiabin Tang, Tianyu Fan, and Chao Huang. AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents, 2025. URLhttps://arxiv.org/abs/2502.05957
-
[53]
Y our agent may misevolve: Emergent risks in self-evolving LLM agents
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025
-
[54]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[55]
Executable code actions elicit better llm agents
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024. 15 Direct Evaluation of Harness Optimizers via Priority Ranking
work page 2024
-
[56]
Anthropic. System card: Claude Haiku 4.5. Technical report, Anthropic, October 2025. URLhttps: //anthropic.com/claude-haiku-4-5-system-card
work page 2025
-
[57]
System card: Claude Sonnet 4.6
Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf
work page 2026
-
[58]
Introducing GPT-4.1 in the api, April 2025
OpenAI. Introducing GPT-4.1 in the api, April 2025. URLhttps://openai.com/index/gpt-4-1/
work page 2025
-
[59]
OpenAI GPT-5 system card, August 2025
OpenAI. OpenAI GPT-5 system card, August 2025. URLhttps://openai.com/index/gpt-5- system-card/
work page 2025
-
[60]
Update to GPT-5 system card: GPT-5.2, December 2025
OpenAI. Update to GPT-5 system card: GPT-5.2, December 2025. URLhttps://openai.com/ index/gpt-5-system-card-update-gpt-5-2/
work page 2025
-
[61]
GPT-5.5 system card, April 2026
OpenAI. GPT-5.5 system card, April 2026. URLhttps://openai.com/index/gpt-5-5-system- card/
work page 2026
-
[62]
Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December
-
[63]
URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3- Flash-Model-Card.pdf
-
[64]
Google DeepMind. Gemini 3 Pro model card. Technical report, Google DeepMind, November 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro- Model-Card.pdf
work page 2025
-
[65]
DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026
DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, April 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
work page 2026
-
[66]
Qwen3.5: Accelerating productivity with native multimodal agents, February 2026
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
work page 2026
-
[67]
Qwen3.6-Plus: Towards real world agents, April 2026
Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6
work page 2026
-
[68]
Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026
Moonshot AI. Kimi K2.6: Scaling agent orchestration with multimodal integration, April 2026. URL https://www.kimi.com/blog/kimi-k2-6
work page 2026
-
[69]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/ abs/2602.15763
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 16 Direct Evaluation of Harness Optimizers via Priority Ranking A. Appendix Contents •Limitations: Appendix B •Details on Analyses in Section 3: App...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring
Memory path layout:shared_memory_path_for_run(run_dir)→ <dataset_output>/memory/<agent>/<model>/memory.json. Valid MEMORY upgrades: •SubclassMemoryRetrieverwith keyword/topic overlap scoring. •FailureMemory— store onlywas_correct: falserecords for lesson retrieval. •EntityResolutionMemory— cache canonical entity→URL mappings encountered before. •Redesign ...
-
[73]
Read base — note current retriever type,_store_memory_resultshape
-
[74]
Read trajectory — focus on cases where prior-task memory should have helped but didn’t
- [75]
-
[76]
Write meta:{"component": "memory", "coding_agent": "...", "base": "...", "hypothesis": "...", ...} ## REMINDERS Exactly ONE upgraded harness. Primary axis MEMORY. No task-specific hints. Figure 15.Optimizer prompt template for the Memory axis. 35 Direct Evaluation of Harness Optimizers via Priority Ranking Optimizer Prompt Template (skills.md) (Prompt) ##...
-
[77]
Prompt assembly in_build_agent_config(work_dir, problem)withextra_template_vars
-
[78]
re-check sources agree before Submitting
Memory/tools rendering inside the system prompt ({memory_context},{tool_list}). Valid PROMPT upgrades: •Stricter final-answer format rule (short exact strings, no explanations). •Decompose-question-first scaffold (multi-hop decomposition before search). •Verify-before-commit rule (“re-check sources agree before Submitting”). •“If conflicting sources, fetc...
-
[79]
Read base — note currentSYSTEM_TEMPLATE,INSTANCE_TEMPLATE
-
[80]
Read trajectory — focus onwas_correct: false+ answer-format mismatches / wrong entity picks
- [81]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.