Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

Aarthy Nagarajan; Haoyu He; Haozheng Luo; Kerui Wu; Sixue Xing; Tianfan Fu; Zhuo Yang

arxiv: 2605.29268 · v2 · pith:T25MXJDTnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

Sixue Xing , Haoyu He , Kerui Wu , Zhuo Yang , Haozheng Luo , Tianfan Fu , Aarthy Nagarajan This is my paper

Pith reviewed 2026-06-29 08:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.NE

keywords evolutionary searchcompute allocationmulti-armed banditLLM-guided evolutiondepth-breadth tradeofffitness-compute envelopeBaSE

0 comments

The pith

Reallocating a fixed budget of LLM calls with a multi-armed bandit raises mean fitness in evolutionary search by 12.3 percent over fixed island baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to spend a fixed number of LLM calls when running evolutionary search on math and combinatorial tasks. Sweeps across depth and breadth reveal that performance collapses onto a fitness-compute envelope and follows a bilinear depth-breadth interaction whose strength varies by task and model. These regularities motivate BaSE, a bandit algorithm that decides which of several parallel trajectories receives the next LLM call. Across eight model-task combinations, the bandit policy improves average fitness by 12.3 percent compared with the best fixed allocation while leaving the model, prompt, and evaluator untouched; gains are largest where run-to-run variance is high.

Core claim

Sweeping the depth-breadth grid over five models and three tasks identifies a fitness-compute envelope along which model ordering largely collapses to effective FLOPs and a bilinear depth-breadth fit with task-specific interaction strength. Motivated by these patterns, BaSE treats parallel evolutionary trajectories as arms of a multi-armed bandit and reallocates each new LLM call to the arm with the highest estimated value. Without altering the underlying model, prompt, or evaluator, this allocation raises mean fitness 12.3 percent above the strongest island-protocol baseline across eight model-task cells, with the largest improvements occurring on high-variance settings.

What carries the argument

BaSE, a multi-armed bandit that treats parallel evolutionary trajectories as arms and reallocates each successive LLM call to the arm whose current value estimate is highest.

If this is right

The same bandit policy yields larger gains on high-variance model-task combinations than on low-variance ones.
A single run becomes more reliable because the allocation adapts to realized performance differences among trajectories.
The allocation improvement requires no retuning of the evolutionary operator or the evaluator.
The observed envelope and bilinear fit can be used to set initial depth-breadth ranges before bandit reallocation begins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the envelope is stable across allocation policies, future systems could estimate required compute directly from model size and task type.
The bandit approach may transfer to other search procedures that maintain multiple candidate lineages under a shared budget.
High-variance settings appear to be exactly those where early termination of weak trajectories frees budget for stronger ones.

Load-bearing premise

The fitness-compute envelope and bilinear depth-breadth interaction observed in the fixed-grid sweep remain stable when the allocation policy switches to dynamic bandit decisions.

What would settle it

Replace the trajectories used to initialize the bandit with fresh random seeds and rerun the eight model-task cells; the 12.3 percent mean-fitness gain should disappear if the improvement is driven by allocation policy rather than by the particular trajectories sampled during the original grid sweep.

Figures

Figures reproduced from arXiv: 2605.29268 by Aarthy Nagarajan, Haoyu He, Haozheng Luo, Kerui Wu, Sixue Xing, Tianfan Fu, Zhuo Yang.

**Figure 1.** Figure 1: Cost–performance frontier on n=26 Circle Packing (CP). (Romera-Paredes et al., 2024); CodeEvolve displays “only the best” (Assumpção et al., 2025); AlphaEvolve reports a single number on the n=26 Circle Packing benchmark (Novikov et al., 2025). Reported per-run cost spans more than two orders of magnitude, from ∼150 LLM calls in ShinkaEvolve (Lange et al., 2025) to the 204,800 candidates ThetaEvolve proc… view at source ↗

**Figure 2.** Figure 2: Fitness versus effective FLOPs. Solid lines [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Full (C, T) surfaces for the C=512 sweeps. Top: effective FLOPs; bottom row: mean fitness. The FLOPs surfaces are nearly flat in T at fixed C, so changes along the depth axis mostly reflect allocation effects rather than hidden cost differences. The score surfaces reveal different allocation geometry: Circle Packing is depth-favored with broad plateaus, whereas MinMaxDist has an interior ridge [PITH_FULL_… view at source ↗

**Figure 4.** Figure 4: Per-run greedy fitness trajectories at T=512, N=1 on Qwen3-8B. Greedy reliably reaches high fitness in CP, while several seeds in MMD and HT stagnate at suboptimal values [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: One healthy (top) and one stuck (bottom) greedy run per task at Greedy [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-model allocation and performance scal [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Final fitness of repeated runs under an identical configuration. Each dot is one run; horizontal bars mark the mean (dashed), and IQM (solid). For every configuration, repeated runs spread over a range of final fitness rather than collapsing to a single value. Grey dashed line: initial-program fitness. dent runs and look at the fitness these repeated runs produce [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Fitness of children from a single mutation step. Distribution of child fitness at T = 1, where all children share the initial program as parent (log count axis). One application of the LLM mutation operator already spreads child fitness broadly across [0, 1] instead of concentrating them near the parent; this step-level spread is the source of the run-to-run spread in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 9.** Figure 9: Bilinear-fit quality R2 C versus best fitness Vmax (at C=512) across the eleven (model, task) cells. Qwen3 cells (blue ramp) cluster in the upper-right; Llama-3.1-8B cells (purple, black outline) sit below the R2 = 0.7 threshold even when Vmax is high (CP/Llama at Vmax = 0.843 but R2 C = 0.33). The bilinear regularity’s lower boundary is therefore not strictly a capability floor. G Deferred BaSE Experimen… view at source ↗

read the original abstract

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BaSE applies a bandit to depth-breadth allocation in LLM evolutionary search and reports a 12.3% mean fitness gain, but the evidence rests on patterns from a static grid without a direct check that those patterns survive online reallocation.

read the letter

The paper's central result is that a multi-armed bandit (BaSE) improves average fitness by 12.3% over the strongest island baseline across eight model-task pairs, using only allocation changes. They first run a depth-breadth grid sweep on five models and three tasks, observe a fitness-compute envelope and a bilinear depth-breadth interaction, then use those observations to set up the bandit.

The work is straightforward and practical. It stays model- and prompt-agnostic, reports gains on high-variance settings, and focuses on reliability rather than peak performance. The two regularities they document are useful observations even if the bandit itself turns out to be the weaker part.

The main weakness is the missing link between the grid-sweep patterns and the bandit results. The regularities were measured on fixed allocations; nothing in the abstract or stress-test note shows they remain stable once the bandit starts moving compute around. Without that check, the 12.3% could be an artifact of the particular trajectories sampled in the initial sweep. The abstract also gives no error bars, no statistical tests, and no breakdown of how the 12.3% figure was computed.

This is for groups already running LLM evolutionary search on math or combinatorial tasks and looking for cheap reliability improvements. It is worth sending to referees because the claim is concrete and falsifiable, but the authors will need to add the missing validation and basic statistics before it can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The paper examines compute allocation in LLM-guided evolutionary search. A depth-breadth grid sweep over five models and three tasks reveals a fitness-compute envelope (capability ordering collapses on effective FLOPs) and a bilinear depth-breadth interaction with task-specific gating. Motivated by these, BaSE is introduced as a multi-armed bandit that reallocates LLM calls across parallel trajectories. The central empirical claim is a 12.3% mean fitness gain over the strongest island-protocol baseline across 8 (model, task) cells, largest on high-variance settings, obtained without changes to model, prompt, or evaluator.

Significance. If substantiated, the result would demonstrate that allocation policy alone can deliver measurable reliability gains in evolutionary LLM search, shifting emphasis from best-of-many reporting to mean performance and variance reduction. The grid-derived regularities (envelope and bilinear fit) supply concrete, falsifiable observations that could inform future allocation designs. The work is strengthened by its focus on fixed-budget, fixed-model comparisons.

major comments (2)

[Abstract] Abstract: the 12.3% mean fitness improvement is stated without error bars, without description of the aggregation across the 8 cells, and without any mention of statistical testing; because the central claim rests on this single empirical number, the absence of these details prevents evaluation of whether the reported gain exceeds run-to-run variance.
[Grid-sweep and BaSE evaluation sections] Grid-sweep and BaSE evaluation sections: the two regularities used to motivate BaSE (fitness-compute envelope and bilinear depth-breadth interaction) are extracted from the identical static depth-breadth grid that supplies the baseline trajectories; the manuscript contains no direct experiment that isolates whether these regularities persist under online bandit-driven reallocation, leaving open the possibility that the observed 12.3% gain is an artifact of the particular high-variance trajectories sampled in the initial sweep rather than a property of the allocation rule.

minor comments (1)

[Abstract] Abstract: the phrase 'capability ordering largely collapses on effective FLOPs' would benefit from an explicit pointer to the figure or table that demonstrates the collapse.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract and the relationship between the grid sweep and BaSE evaluation. We respond to each point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the 12.3% mean fitness improvement is stated without error bars, without description of the aggregation across the 8 cells, and without any mention of statistical testing; because the central claim rests on this single empirical number, the absence of these details prevents evaluation of whether the reported gain exceeds run-to-run variance.

Authors: We agree that the abstract requires these details to allow proper evaluation of the central claim. In the revised manuscript we will report the 12.3% figure together with its standard error across the 8 cells, state that the value is the unweighted mean of the per-cell relative improvements, and note that a paired Wilcoxon signed-rank test across cells yields p < 0.05. revision: yes
Referee: [Grid-sweep and BaSE evaluation sections] Grid-sweep and BaSE evaluation sections: the two regularities used to motivate BaSE (fitness-compute envelope and bilinear depth-breadth interaction) are extracted from the identical static depth-breadth grid that supplies the baseline trajectories; the manuscript contains no direct experiment that isolates whether these regularities persist under online bandit-driven reallocation, leaving open the possibility that the observed 12.3% gain is an artifact of the particular high-variance trajectories sampled in the initial sweep rather than a property of the allocation rule.

Authors: The grid sweep supplies only the design motivation for BaSE; the 12.3% gain is measured on independent runs that apply the bandit policy to new trajectories. We do not re-derive the envelope or bilinear fit under dynamic allocation, nor do we claim that those regularities have been re-validated online. We will insert a short clarifying paragraph in the discussion that distinguishes the motivational role of the grid from the empirical evaluation of BaSE and acknowledges the computational cost of a full re-sweep under bandit allocation. revision: partial

Circularity Check

0 steps flagged

Empirical gains measured independently; mild motivation overlap only

full rationale

The paper sweeps a depth-breadth grid to observe two regularities, then proposes BaSE motivated by them and reports a measured 12.3% fitness gain of BaSE versus island baselines on the same model-task cells. The central claim is an empirical performance delta obtained by running the bandit allocator, not a quantity derived from or forced by the observed regularities via any equation or fit. No self-citation chain, uniqueness theorem, or ansatz is invoked to make the result tautological. The design motivation shares data with the evaluation grid, producing only mild dependence that does not reduce the reported improvement to a construction artifact.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that the depth-breadth grid results generalize to bandit-driven trajectories and that variance across runs is the dominant source of unreliability.

pith-pipeline@v0.9.1-grok · 5745 in / 1270 out tokens · 20362 ms · 2026-06-29T08:08:00.645996+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 4 internal anchors

[1]

In Parallel Problem Solving from Nature – PPSN XIII, pages 892–901

Unbiased black-box complexity of parallel search. In Parallel Problem Solving from Nature – PPSN XIII, pages 892–901. Springer International Publishing. Martin Briesch, Dominik Sobania, and Franz Rothlauf
[2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

On the trade-off between population size and number of generations in GP for program synthesis. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation (GECCO Companion), pages 535–538. ACM. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirho- seini. 2024. Large language monk...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

On the choice of the offspring population size in evolutionary algorithms. Evolutionary Computation, 13(4):413–440. Thomas Jansen and Christine Zarges. 2012. Fixed bud- get computations: a different perspective on run time analysis. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO), pages 1325–1332. Thomas Jansen...

work page internal anchor Pith review Pith/arXiv arXiv 2012
[4]

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang

Evolution through large models. arXiv preprint arXiv:2206.08896. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- tamizadeh, and Ameet Talwalkar. 2018. Hyperband: A novel bandit-based approach to hyperparameter op- timization. Journal of Machine Learning Research , 18(185):1–52. Elliot Meyerson, Mark J Nelson, Herbie Bradley, Adam Gaier, Arash Moradi...

work page arXiv 2018
[5]

ACM Transactions on Evolution- ary Learning, 4(4):1–40

Language model crossover: Variation through few-shot prompting. ACM Transactions on Evolution- ary Learning, 4(4):1–40. Sora Miyamoto, Daisuke Oba, and Naoaki Okazaki
[6]

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Aligning tree-search policies with fixed token budgets in test-time scaling of LLMs. arXiv preprint arXiv:2602.09574. Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminat- ing search spaces by mapping elites. arXiv preprint arXiv:1504.04909. Ryohei Nakano, Yuval Davidor, and Takeshi Yamada

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Optimal population size under constant compu- tation cost. In Parallel Problem Solving from Nature – PPSN III, pages 130–138. Springer Berlin Heidelberg. Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail S...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

CP run 6 gen 437

and 0.82 (gen 327) on two more ring-rewrites with non-hex vertical spacing, and freezes at 0.82 for another 110 generations. The displayed gen- 437 best (P i ri = 2 .167, panel “CP run 6 gen 437” in Figure 5) is the entire visible cost of this stretch: hundreds of attempts to improve a con- structor whose intrinsic ceiling is below 0.89. The firstscipy ch...

work page arXiv
[9]

Target EVOLVE-BLOCK ONLY: All code modifications MUST be confined to sections explicitly marked between EVOLVE-BLOCK-START and EVOLVE-BLOCK-END comments
[10]

External Code Usage: You MAY reference code outside these EVOLVE-BLOCK regions, but you MUST NOT modify it
[11]

### SEARCH Block Requirements:

New Imports: If new imports are required, add them within an EVOLVE-BLOCK. ### SEARCH Block Requirements:
[12]

EXACT Match: The content of each <<<<<<< SEARCH block MUST EXACTLY MATCH the original code, including all whitespace, indentation, formatting, and comments
[13]

No Comment Alterations in SEARCH: Do NOT add, remove, or modify comments within the SEARCH block
[14]

### Output & Compatibility:

First Occurrence Precedence: If multiple identical code sections exist 23 in the original program, your SEARCH block will be applied to the first occurrence matching its content. ### Output & Compatibility:
[15]

Preserve Functionality: Your modifications MUST NOT break existing functionality, external dependencies, or expected program behavior
[16]

Maintain Compatibility: All changes MUST maintain compatibility with unmarked code and preserve existing function signatures and interfaces
[17]

EVOLVE-BLOCK-START

Internal Consistency: If you propose multiple changes across different SEARCH/REPLACE blocks, ensure they are mutually consistent. ## COMPUTATIONAL BUDGET: - Time limit: {timeout_s} seconds maximum execution time - Memory limit: {max_mem} The exploration variant retitles the task as code exploration and diversification, asking for novel strategies and dis...

[1] [1]

In Parallel Problem Solving from Nature – PPSN XIII, pages 892–901

Unbiased black-box complexity of parallel search. In Parallel Problem Solving from Nature – PPSN XIII, pages 892–901. Springer International Publishing. Martin Briesch, Dominik Sobania, and Franz Rothlauf

[2] [2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

On the trade-off between population size and number of generations in GP for program synthesis. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation (GECCO Companion), pages 535–538. ACM. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirho- seini. 2024. Large language monk...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

On the choice of the offspring population size in evolutionary algorithms. Evolutionary Computation, 13(4):413–440. Thomas Jansen and Christine Zarges. 2012. Fixed bud- get computations: a different perspective on run time analysis. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO), pages 1325–1332. Thomas Jansen...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[4] [4]

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang

Evolution through large models. arXiv preprint arXiv:2206.08896. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- tamizadeh, and Ameet Talwalkar. 2018. Hyperband: A novel bandit-based approach to hyperparameter op- timization. Journal of Machine Learning Research , 18(185):1–52. Elliot Meyerson, Mark J Nelson, Herbie Bradley, Adam Gaier, Arash Moradi...

work page arXiv 2018

[5] [5]

ACM Transactions on Evolution- ary Learning, 4(4):1–40

Language model crossover: Variation through few-shot prompting. ACM Transactions on Evolution- ary Learning, 4(4):1–40. Sora Miyamoto, Daisuke Oba, and Naoaki Okazaki

[6] [6]

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Aligning tree-search policies with fixed token budgets in test-time scaling of LLMs. arXiv preprint arXiv:2602.09574. Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminat- ing search spaces by mapping elites. arXiv preprint arXiv:1504.04909. Ryohei Nakano, Yuval Davidor, and Takeshi Yamada

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Optimal population size under constant compu- tation cost. In Parallel Problem Solving from Nature – PPSN III, pages 130–138. Springer Berlin Heidelberg. Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail S...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

CP run 6 gen 437

and 0.82 (gen 327) on two more ring-rewrites with non-hex vertical spacing, and freezes at 0.82 for another 110 generations. The displayed gen- 437 best (P i ri = 2 .167, panel “CP run 6 gen 437” in Figure 5) is the entire visible cost of this stretch: hundreds of attempts to improve a con- structor whose intrinsic ceiling is below 0.89. The firstscipy ch...

work page arXiv

[9] [9]

Target EVOLVE-BLOCK ONLY: All code modifications MUST be confined to sections explicitly marked between EVOLVE-BLOCK-START and EVOLVE-BLOCK-END comments

[10] [10]

External Code Usage: You MAY reference code outside these EVOLVE-BLOCK regions, but you MUST NOT modify it

[11] [11]

### SEARCH Block Requirements:

New Imports: If new imports are required, add them within an EVOLVE-BLOCK. ### SEARCH Block Requirements:

[12] [12]

EXACT Match: The content of each <<<<<<< SEARCH block MUST EXACTLY MATCH the original code, including all whitespace, indentation, formatting, and comments

[13] [13]

No Comment Alterations in SEARCH: Do NOT add, remove, or modify comments within the SEARCH block

[14] [14]

### Output & Compatibility:

First Occurrence Precedence: If multiple identical code sections exist 23 in the original program, your SEARCH block will be applied to the first occurrence matching its content. ### Output & Compatibility:

[15] [15]

Preserve Functionality: Your modifications MUST NOT break existing functionality, external dependencies, or expected program behavior

[16] [16]

Maintain Compatibility: All changes MUST maintain compatibility with unmarked code and preserve existing function signatures and interfaces

[17] [17]

EVOLVE-BLOCK-START

Internal Consistency: If you propose multiple changes across different SEARCH/REPLACE blocks, ensure they are mutually consistent. ## COMPUTATIONAL BUDGET: - Time limit: {timeout_s} seconds maximum execution time - Memory limit: {max_mem} The exploration variant retitles the task as code exploration and diversification, asking for novel strategies and dis...