Effective Harness Engineering for Algorithm Discovery with Coding Agents

Masafumi Oyamada; Taro Yano; Yoichi Ishibashi

REVIEW 2 major objections 2 minor 2 cited by

Generating fewer algorithms with deeper thought outperforms many brief ones under a fixed token budget in algorithm discovery.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-19 18:03 UTC pith:JPFVERVP

load-bearing objection Deeper reasoning per candidate beat more shallow ones under fixed tokens on Circle Packing, but the single-task result keeps the efficiency claim narrow. the 2 major comments →

arxiv 2605.15221 v1 pith:JPFVERVP submitted 2026-05-13 cs.SE cs.AIcs.CL

Effective Harness Engineering for Algorithm Discovery with Coding Agents

Yoichi Ishibashi , Taro Yano , Masafumi Oyamada This is my paper

classification cs.SE cs.AIcs.CL

keywords harness engineeringalgorithm discoveryLLM evolutionary searchevaluation hacksCircle Packingtoken budget allocationcoding agentsVesper framework

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how execution harness design shapes success when large language models are paired with evolutionary search for finding algorithms. It specifically tests whether a fixed token budget is better spent on many candidates with short reasoning or fewer candidates with extended reasoning, while also examining ways to detect evaluation hacks and enable safe parallel execution. On the Circle Packing benchmark, the results favor deeper per-candidate reasoning, showing that improving solution quality is more efficient than running more evolutionary generations. The framework Vesper incorporates these harness changes and reveals that stronger models produce evaluation hacks more frequently.

Core claim

Under a fixed token budget, the harness that produces fewer algorithms but allows each one more internal reasoning steps achieves higher scores than the harness that produces many algorithms with brief reasoning. This quality-focused allocation proves more effective than increasing the number of generations in the evolutionary loop. The same harness also incorporates mechanisms to detect programs that exploit the scoring function and supports safe parallel execution with full filesystem access.

What carries the argument

The harness component that trades off the number of generated algorithms against the depth of reasoning tokens allocated to each one, combined with hack detection logic.

Load-bearing premise

That the advantage of deeper thinking observed on Circle Packing under one fixed token budget will hold for other algorithm-discovery tasks and different resource limits.

What would settle it

Re-running the Vesper framework on a second benchmark such as FunSearch or with a different token budget and checking whether the deeper-reasoning advantage disappears or reverses.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Algorithm discovery pipelines should allocate larger shares of the token budget to reasoning depth rather than to population size.
Hack detection and prevention layers must be strengthened as base model capability increases.
Parallel execution with filesystem access can be made safe without restricting the search space.
Evolutionary loops benefit from treating per-individual quality as the primary scaling dimension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The quality-over-quantity pattern may extend to other automated discovery settings such as theorem proving or scientific code generation.
Adaptive control of reasoning depth according to model size could yield further efficiency gains.
Overall discovery cost could drop if harnesses consistently favor depth, allowing the same hardware to explore more challenging problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Deeper reasoning per candidate beat more shallow ones under fixed tokens on Circle Packing, but the single-task result keeps the efficiency claim narrow.

read the letter

The paper's clearest result is that, on Circle Packing with a fixed token budget, allocating more tokens to deeper thought for each candidate produced higher scores than spreading the budget across more candidates with shallower thought. They also report that stronger models generated evaluation hacks at higher rates, which makes hack detection more important as capabilities improve. Both observations come from running their Vesper framework against the same benchmark used in prior work like AlphaEvolve and FunSearch. The practical focus on harness questions—token partitioning, hack handling, and safe parallel execution with filesystem access—is the part that feels most directly usable for people already building these systems. The depth-versus-breadth comparison and the hack-rate scaling note are not in the cited earlier papers, so those are the incremental contributions. The main limitation is that the ranking and the efficiency conclusion rest on one task and one budget. Circle Packing has a relatively smooth objective; the same token split could easily reverse on problems with expensive evaluations, higher-dimensional spaces, or deceptive fitness landscapes. Without additional tasks or explicit controls for those factors, the budget-efficiency claim stays tied to this specific regime. The abstract gives no numbers, error bars, or ablation details, so the full paper will need to show the raw scores and how post-hoc decisions were avoided. This is the kind of engineering note that groups running LLM search agents would want to see. It raises concrete questions worth checking even if the generalization is limited. I would send it to peer review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces Vesper, an algorithm discovery framework that improves upon harness design for combining LLMs with evolutionary search. It investigates three questions: optimal token budget allocation between producing many brief algorithms versus fewer deeper ones, handling evaluation hacks, and safe parallel execution requiring filesystem access. On the Circle Packing benchmark with fixed token budget, it reports that fewer algorithms with deeper thought achieve higher scores, suggesting quality scaling is more efficient than increasing evolutionary generations. It also notes higher hack rates with more capable models.

Significance. This work contributes practically to the field of automated algorithm discovery by emphasizing harness engineering details that affect performance. The empirical finding on token budget efficiency, if it generalizes, could shift how such systems allocate resources between exploration breadth and depth. Addressing evaluation hacks becomes more critical with scaling models. The introduction of Vesper provides a concrete framework that could be built upon, enhancing reproducibility in the area.

major comments (2)

[Experiments] Experiments section: The central empirical result—that deeper per-algorithm reasoning outperforms scaling the number of generations under fixed token budget—is demonstrated only on the Circle Packing benchmark. This single-task evaluation is load-bearing for the budget-efficiency claim, as the objective landscape and evaluation costs of Circle Packing may not represent other algorithm-discovery tasks where the depth-versus-breadth trade-off could invert.
[Methods] Methods section: The abstract states clear empirical outcomes but the manuscript provides no quantitative details such as exact scores, error bars, number of runs, or ablation data on token allocation between reasoning and generation steps. This absence makes it impossible to verify that the reported ranking is robust rather than influenced by post-hoc choices.

minor comments (2)

[Abstract] Abstract: Consider specifying the numerical token budget used and the concrete performance scores achieved to give immediate context to the efficiency claim.
[Introduction] Introduction: Add precise citations for AlphaEvolve and FunSearch and explicitly delineate which harness improvements are novel versus incremental.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper to improve transparency and acknowledge limitations in scope.

read point-by-point responses

Referee: [Experiments] Experiments section: The central empirical result—that deeper per-algorithm reasoning outperforms scaling the number of generations under fixed token budget—is demonstrated only on the Circle Packing benchmark. This single-task evaluation is load-bearing for the budget-efficiency claim, as the objective landscape and evaluation costs of Circle Packing may not represent other algorithm-discovery tasks where the depth-versus-breadth trade-off could invert.

Authors: We agree that limiting the primary empirical demonstration to Circle Packing constrains the generalizability of the depth-versus-breadth finding. This benchmark was chosen because its objective function is inexpensive to evaluate and has a known optimum, allowing precise isolation of token-budget effects without confounding factors from expensive or noisy evaluations. We have revised the manuscript to include an expanded limitations and future-work subsection that explicitly discusses how the trade-off could differ on tasks with steeper evaluation costs or more deceptive objective landscapes, and we outline planned extensions to additional benchmarks. revision: yes
Referee: [Methods] Methods section: The abstract states clear empirical outcomes but the manuscript provides no quantitative details such as exact scores, error bars, number of runs, or ablation data on token allocation between reasoning and generation steps. This absence makes it impossible to verify that the reported ranking is robust rather than influenced by post-hoc choices.

Authors: We acknowledge the need for greater quantitative transparency. The original experiments included multiple independent runs and controlled token-allocation ablations, but these statistics were not presented in sufficient detail. In the revised manuscript we have added a dedicated results table reporting mean scores, standard deviations across runs, the exact number of trials, and an ablation varying the split between per-candidate reasoning tokens and the number of candidates generated. These additions allow direct verification of the reported ranking. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central results are direct experimental measurements

full rationale

The paper presents empirical results from running the Vesper framework on the Circle Packing benchmark under a fixed token budget. The key observation—that fewer algorithms with deeper per-candidate reasoning outperformed more generations—is reported as an outcome of those controlled experiments rather than a quantity derived from equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce by construction to the inputs; the derivation chain consists of harness design choices followed by external-benchmark measurements, which remain falsifiable outside any self-citation or ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that Circle Packing under a single token budget is representative of general algorithm-discovery performance and that the Vesper harness changes are the decisive factor in the observed improvement.

axioms (1)

domain assumption Circle Packing is a suitable and representative benchmark for evaluating harness design in algorithm discovery
All reported results are obtained on this single task.

invented entities (1)

Vesper framework no independent evidence
purpose: Implements the proposed harness improvements for safe parallel execution and hack handling
New system introduced to test the design questions.

pith-pipeline@v0.9.0 · 5710 in / 1167 out tokens · 33097 ms · 2026-05-19T18:03:48.825161+00:00 · methodology

0 comments

read the original abstract

AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly, generating fewer algorithms while thinking more deeply about each one achieved higher scores. That is, scaling the quality of each individual is more budget-efficient than scaling the number of evolutionary generations. Surprisingly, more capable models produced evaluation hacks at higher rates, making hack detection increasingly necessary as models scale.

Figures

Figures reproduced from arXiv: 2605.15221 by Masafumi Oyamada, Taro Yano, Yoichi Ishibashi.

**Figure 1.** Figure 1: Vesper’s evolutionary loop (top) and details of each harness improvement (bottom). Dotted borders indicate components shared with existing pipelines; solid borders indicate harness improvements introduced by Vesper. The following cycle repeats until the token budget is exhausted. (1) Select a parent: sample a parent program (Git branch) from the program database. (2) Set up environment: create a Git worktr… view at source ↗

**Figure 2.** Figure 2: Best score progression. (a) Cumulative tokens, (b) cumulative API cost. Each marker represents an individual (algorithm) that updated the best score. Conditions without DB observation only. Algorithms flagged as evaluation hacks are excluded. Dashed lines indicate AlphaEvolve (2.635) and the human best (2.634). 10 2 Tokens per algorithm (K) 2.40 2.45 2.50 2.55 2.60 2.65 Best score (sum of radii) AlphaEvolv… view at source ↗

**Figure 3.** Figure 3: Relationship between per-iteration token investment (Tok/Algo) and best score. Each marker represents an experimental condition from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Best score progression versus cumulative API cost. Each marker represents an individual (algorithm) that updated the best score. Even when OpenEvolve’s budget is expanded to match Vesper’s cost ($392, 146M tokens), its score plateaus at 2.502. Vesper approaches the human best at $42 (gpt-5.1-codex-mini) and surpasses AlphaEvolve at $391 (gpt-5.2-codex). Diamonds indicate OpenEvolve at the 40M token ($107) … view at source ↗

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Automated Discovery Has No Universally Superior Harness
cs.CL 2026-07 accept novelty 6.0

No fixed discovery harness is reliably superior across 12 model–problem pairs, OpenEvolve-style recipes underperform simpler alternatives, and online pruning of weak partial runs improves budget-matched performance.
Recursive Self-Improvement in AI: From Bounded Self-Refinement to Autonomous Research Loops
cs.AI 2026-07 conditional novelty 6.0

A survey of 1,250 papers organizes AI self-improvement along two axes—what is improved and loop closure—finding that demonstrated self-improvement strength tracks a verification hierarchy from formal verifiers down to...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov and Ng. AlphaEvolve:. CoRR , volume =. 2025 , doi =. 2506.13131 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Mathematical discoveries from program search with large language models , journal =

Bernardino Romera. Mathematical discoveries from program search with large language models , journal =. 2024 , doi =

work page 2024
[3]

Lehman, J

Joel Lehman and Jonathan Gordon and Shawn Jain and Kamal Ndousse and Cathy Yeh and Kenneth O. Stanley , title =. CoRR , volume =. 2022 , doi =. 2206.08896 , timestamp =

work page arXiv 2022
[4]

Forty-first International Conference on Machine Learning,

Fei Liu and Xialiang Tong and Mingxuan Yuan and Xi Lin and Fu Luo and Zhenkun Wang and Zhichao Lu and Qingfu Zhang , title =. Forty-first International Conference on Machine Learning,. 2024 , timestamp =

work page 2024
[5]

Haoran Ye and Jiarui Wang and Zhiguang Cao and Federico Berto and Chuanbo Hua and Haeyeon Kim and Jinkyoo Park and Guojie Song , title =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

work page 2024
[6]

Niki van Stein and Thomas B. LLaMEA:. 2025 , doi =

work page 2025
[7]

Genetic Programming Theory and Practice

Herbie Bradley and Honglu Fan and Theodoros Galanos and Ryan Zhou and Daniel Scott and Joel Lehman , title =. Genetic Programming Theory and Practice. 2023 , doi =

work page 2023
[8]

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Robert Tjarko Lange and Yuki Imajuku and Edoardo Cetin , title =. CoRR , volume =. 2025 , doi =. 2509.19349 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Le and Denny Zhou and Xinyun Chen , title =

Chengrun Yang and Xuezhi Wang and Yifeng Lu and Hanxiao Liu and Quoc V. Le and Denny Zhou and Xinyun Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , timestamp =

work page 2024
[10]

A systematic survey on large language models for algorithm design,

Fei Liu and Yiming Yao and Ping Guo and Zhiyuan Yang and Zhe Zhao and Xi Lin and Xialiang Tong and Mingxuan Yuan and Zhichao Lu and Zhenkun Wang and Qingfu Zhang , title =. CoRR , volume =. 2024 , doi =. 2410.14716 , timestamp =

work page arXiv 2024
[11]

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V

Xingyu Wu and Sheng. Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap , journal =. 2024 , doi =. 2401.10034 , timestamp =

work page arXiv 2024
[12]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , timestamp =

work page 2024
[13]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang and Shengran Hu and Cong Lu and Robert T. Lange and Jeff Clune , title =. CoRR , volume =. 2025 , doi =. 2505.22954 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025

Gang Liu and Yihan Zhu and Jie Chen and Meng Jiang , title =. CoRR , volume =. 2025 , doi =. 2510.06056 , timestamp =

work page arXiv 2025
[15]

Henrique S. Assump. CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization , journal =. 2025 , doi =. 2510.14150 , timestamp =

work page internal anchor Pith review arXiv 2025
[16]

Algorithm evolution using large language model,

Fei Liu and Xialiang Tong and Mingxuan Yuan and Qingfu Zhang , title =. CoRR , volume =. 2023 , doi =. 2311.15249 , timestamp =

work page arXiv 2023
[17]

Illuminating search spaces by mapping elites

Jean. Illuminating search spaces by mapping elites , journal =. 2015 , eprinttype =. 1504.04909 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Tianshi Zheng and Zheye Deng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Zihao Wang and Yangqiu Song , title =. CoRR , volume =. 2025 , doi =. 2505.13259 , timestamp =

work page arXiv 2025
[19]

Concrete Problems in AI Safety

Dario Amodei and Chris Olah and Jacob Steinhardt and Paul F. Christiano and John Schulman and Dan Man. Concrete Problems in. CoRR , volume =. 2016 , eprinttype =. 1606.06565 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , title =. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =

work page 2022
[21]

2026 , eprint =

Natural-Language Agent Harnesses , author =. 2026 , eprint =

work page 2026
[22]

2026 , eprint =

Meta-Harness: End-to-End Optimization of Model Harnesses , author =. 2026 , eprint =

work page 2026
[23]

Nghi D. Q. Bui , year =. Building Effective. 2603.05344 , archivePrefix=

work page arXiv
[24]

CoRR , volume =

Yiping Wang and Shao-Rong Su and Zhiyuan Zeng and Eva Xu and Liliang Ren and Xinyu Yang and Zeyi Huang and Xuehai He and Luyao Ma and Baolin Peng and others , title =. CoRR , volume =. 2025 , eprinttype =

work page 2025
[25]

CoRR , volume =

Zhaojian Yu and Kaiyue Feng and Yilun Zhao and Shilin He and Xiao-Ping Zhang and Arman Cohan , title =. CoRR , volume =. 2025 , eprinttype =

work page 2025
[26]

arXiv preprint arXiv:2410.15639 , year=

Can large language models invent algorithms to improve themselves?: Algorithm discovery for recursive self-improvement through reinforcement learning , author=. arXiv preprint arXiv:2410.15639 , year=

work page arXiv

[1] [1]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov and Ng. AlphaEvolve:. CoRR , volume =. 2025 , doi =. 2506.13131 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Mathematical discoveries from program search with large language models , journal =

Bernardino Romera. Mathematical discoveries from program search with large language models , journal =. 2024 , doi =

work page 2024

[3] [3]

Lehman, J

Joel Lehman and Jonathan Gordon and Shawn Jain and Kamal Ndousse and Cathy Yeh and Kenneth O. Stanley , title =. CoRR , volume =. 2022 , doi =. 2206.08896 , timestamp =

work page arXiv 2022

[4] [4]

Forty-first International Conference on Machine Learning,

Fei Liu and Xialiang Tong and Mingxuan Yuan and Xi Lin and Fu Luo and Zhenkun Wang and Zhichao Lu and Qingfu Zhang , title =. Forty-first International Conference on Machine Learning,. 2024 , timestamp =

work page 2024

[5] [5]

Haoran Ye and Jiarui Wang and Zhiguang Cao and Federico Berto and Chuanbo Hua and Haeyeon Kim and Jinkyoo Park and Guojie Song , title =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

work page 2024

[6] [6]

Niki van Stein and Thomas B. LLaMEA:. 2025 , doi =

work page 2025

[7] [7]

Genetic Programming Theory and Practice

Herbie Bradley and Honglu Fan and Theodoros Galanos and Ryan Zhou and Daniel Scott and Joel Lehman , title =. Genetic Programming Theory and Practice. 2023 , doi =

work page 2023

[8] [8]

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Robert Tjarko Lange and Yuki Imajuku and Edoardo Cetin , title =. CoRR , volume =. 2025 , doi =. 2509.19349 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Le and Denny Zhou and Xinyun Chen , title =

Chengrun Yang and Xuezhi Wang and Yifeng Lu and Hanxiao Liu and Quoc V. Le and Denny Zhou and Xinyun Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , timestamp =

work page 2024

[10] [10]

A systematic survey on large language models for algorithm design,

Fei Liu and Yiming Yao and Ping Guo and Zhiyuan Yang and Zhe Zhao and Xi Lin and Xialiang Tong and Mingxuan Yuan and Zhichao Lu and Zhenkun Wang and Qingfu Zhang , title =. CoRR , volume =. 2024 , doi =. 2410.14716 , timestamp =

work page arXiv 2024

[11] [11]

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V

Xingyu Wu and Sheng. Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap , journal =. 2024 , doi =. 2401.10034 , timestamp =

work page arXiv 2024

[12] [12]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , timestamp =

work page 2024

[13] [13]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang and Shengran Hu and Cong Lu and Robert T. Lange and Jeff Clune , title =. CoRR , volume =. 2025 , doi =. 2505.22954 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025

Gang Liu and Yihan Zhu and Jie Chen and Meng Jiang , title =. CoRR , volume =. 2025 , doi =. 2510.06056 , timestamp =

work page arXiv 2025

[15] [15]

Henrique S. Assump. CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization , journal =. 2025 , doi =. 2510.14150 , timestamp =

work page internal anchor Pith review arXiv 2025

[16] [16]

Algorithm evolution using large language model,

Fei Liu and Xialiang Tong and Mingxuan Yuan and Qingfu Zhang , title =. CoRR , volume =. 2023 , doi =. 2311.15249 , timestamp =

work page arXiv 2023

[17] [17]

Illuminating search spaces by mapping elites

Jean. Illuminating search spaces by mapping elites , journal =. 2015 , eprinttype =. 1504.04909 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Tianshi Zheng and Zheye Deng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Zihao Wang and Yangqiu Song , title =. CoRR , volume =. 2025 , doi =. 2505.13259 , timestamp =

work page arXiv 2025

[19] [19]

Concrete Problems in AI Safety

Dario Amodei and Chris Olah and Jacob Steinhardt and Paul F. Christiano and John Schulman and Dan Man. Concrete Problems in. CoRR , volume =. 2016 , eprinttype =. 1606.06565 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , title =. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =

work page 2022

[21] [21]

2026 , eprint =

Natural-Language Agent Harnesses , author =. 2026 , eprint =

work page 2026

[22] [22]

2026 , eprint =

Meta-Harness: End-to-End Optimization of Model Harnesses , author =. 2026 , eprint =

work page 2026

[23] [23]

Nghi D. Q. Bui , year =. Building Effective. 2603.05344 , archivePrefix=

work page arXiv

[24] [24]

CoRR , volume =

Yiping Wang and Shao-Rong Su and Zhiyuan Zeng and Eva Xu and Liliang Ren and Xinyu Yang and Zeyi Huang and Xuehai He and Luyao Ma and Baolin Peng and others , title =. CoRR , volume =. 2025 , eprinttype =

work page 2025

[25] [25]

CoRR , volume =

Zhaojian Yu and Kaiyue Feng and Yilun Zhao and Shilin He and Xiao-Ping Zhang and Arman Cohan , title =. CoRR , volume =. 2025 , eprinttype =

work page 2025

[26] [26]

arXiv preprint arXiv:2410.15639 , year=

Can large language models invent algorithms to improve themselves?: Algorithm discovery for recursive self-improvement through reinforcement learning , author=. arXiv preprint arXiv:2410.15639 , year=

work page arXiv