ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark

Abdullah Naeem; Anav Katwal; Ayon Dey; Manish Bhatt; Md Tamjidul Hoque; Md Wasi Ul kabir

arxiv: 2605.13877 · v2 · pith:C7JY6EUZnew · submitted 2026-05-09 · 💻 cs.NE · cs.AI

ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark

Abdullah Naeem , Md Wasi Ul kabir , Manish Bhatt , Ayon Dey , Anav Katwal , Md Tamjidul Hoque This is my paper

Pith reviewed 2026-05-20 23:22 UTC · model grok-4.3

classification 💻 cs.NE cs.AI

keywords LLM-designed evolutionary algorithmsdifferential evolutionGNBG benchmarkmemetic algorithmsblackbox optimizationautoresearch loopLSHADEevolutionary computation

0 comments

The pith

An LLM-driven research loop produces ARES-LSHADE that wins 510 of 744 GNBG evaluations while respecting strict blackbox rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how an autonomous LLM research loop can iteratively improve a differential evolution algorithm for the GNBG benchmark. Starting from the prior LSHADE winner, the loop adds a scout-augmented mutation operator integrated with adaptive CMA-ES and a multi-start L-BFGS-B polish phase. Under the official 31-run protocol with competition budgets, the resulting algorithm reaches machine precision on 18 of 24 functions and records 510 wins by a gap below 1e-8. The authors also document that limiting the loop to operator edits and fitness observations prevents trivial solutions, while access to compositional metadata allows complete solving of the suite but violates the blackbox requirement.

Core claim

Through approximately thirty LLM-driven design experiments the authors construct ARES-LSHADE as a memetic LSHADE variant featuring a scout-augmented mutation operator with adaptive CMA-ES integration plus a multi-start L-BFGS-B polish phase. On the official evaluation this yields 510 wins out of 744 and machine precision on 18 functions, with the loop independently identifying the six remaining functions as hardest due to their plateau signatures consistent with GNBG compositional structure.

What carries the argument

The autonomous LLM-driven research loop restricted to an operator-only edit surface and fitness-only observation space that generates and validates algorithmic modifications.

If this is right

The restricted LLM loop converges to a performance plateau on this benchmark.
The produced algorithm wins the majority of per-function comparisons while preserving blackbox integrity.
Widening the observation space to include compositional metadata produces trivial solutions that solve all functions but break competition rules.
The loop can self-identify the hardest functions through observed performance signatures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Restricted observation spaces may serve as a general safeguard when using LLMs to design algorithms for other blackbox benchmarks.
Similar autoresearch loops could accelerate algorithm development on additional numerical optimization suites.
Competitions relying on LLM design may require explicit rules on allowable information sources to keep results meaningful.

Load-bearing premise

An LLM research loop limited to operator edits and fitness observations can discover competitive non-trivial algorithms on GNBG without access to the benchmark's compositional metadata.

What would settle it

An independent re-run of ARES-LSHADE on the GNBG suite using the same 31-run-per-function protocol and evaluation budgets that fails to reach machine precision on 18 functions or at least 500 wins.

Figures

Figures reproduced from arXiv: 2605.13877 by Abdullah Naeem, Anav Katwal, Ayon Dey, Manish Bhatt, Md Tamjidul Hoque, Md Wasi Ul kabir.

**Figure 4.** Figure 4: Edit surface and observation space. Of six algorithm components, only the mutation operator was editable by the loop (left). Component_MinimumPosition and aggregate cross-function statistics were withheld from the loop’s observation space (right). Both surfaces are design variables of the loop, not properties of the LLM. 3.2 Observation and Edit Surfaces Two design parameters of the loop turned out to be u… view at source ↗

**Figure 5.** Figure 5: Win-count progression across approximately thirty autoresearch loop iterations (illustrative reconstruction; the report records start point, end plateau, and approximate iteration count, with perexperiment outcomes synthesized for visualization). Innovation steps that broke the previous best are labeled; the loop converges to a stable 16–17 win plateau. The mutation operator submitted to ARES-LSHADE is th… view at source ↗

read the original abstract

We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES-LSHADE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a competitive win count on GNBG with an LLM-tweaked LSHADE variant but the blackbox integrity of the design loop still needs direct checks on prompts and logs.

read the letter

The headline result is that ARES-LSHADE records 510 wins out of 744 on the official 31-run protocol and hits machine precision on 18 of the 24 GNBG functions. The authors built this by running an LLM loop for about thirty experiments that produced a scout-augmented mutation with adaptive CMA-ES plus a multi-start L-BFGS-B polish step, all while keeping the observation space to fitness values only. They also document that widening the space to include compositional metadata immediately produced trivial solvers that broke the blackbox rule, which they caught before submission. That second observation is the most useful part of the work because it shows a concrete limit on how much structure an LLM can be allowed to see without collapsing the benchmark.

Referee Report

1 major / 2 minor

Summary. The manuscript presents ARES-LSHADE, a memetic differential-evolution variant of LSHADE developed via an LLM-driven autoresearch loop for the GECCO 2026 competition on LLM-designed algorithms for the GNBG benchmark. It introduces a scout-augmented mutation operator with adaptive CMA-ES integration and a multi-start L-BFGS-B polish phase, both obtained from an operator-only edit surface observed solely through fitness values across approximately thirty LLM experiments. On the official 31-run-per-function protocol with competition-specified budgets, the algorithm records 510 of 744 wins (gap below 1e-8) and reaches machine precision on 18 of 24 functions; the remaining six are identified as plateau functions consistent with GNBG structure. The paper additionally documents that expanding the observation space to include compositional metadata immediately produced trivial solvers that violate the blackbox rule.

Significance. If the blackbox compliance holds, the work provides concrete evidence that an LLM autoresearch loop restricted to operator edits and fitness observations can converge to a competitive optimizer on a challenging compositional benchmark. The explicit documentation of the integrity boundary (widening the observation space yields trivial solutions) and the public release of code plus reproducibility artifacts at https://github.com/anaeem1/ARES-LSHADE constitute clear strengths that aid verification and future research in LLM-assisted evolutionary algorithm design.

major comments (1)

[§3] §3 (autoresearch loop description): the central performance claim of 510 wins under strict blackbox rules depends on the absence of any indirect channel that could have allowed the LLM to reconstruct GNBG compositional structure from pre-training or prompt presentation. The manuscript states that metadata was avoided, yet supplies no explicit verification (e.g., prompt templates, context-window contents, or ablation logs) that would permit independent confirmation; this verification is load-bearing because the authors themselves observed that any widening of the observation space immediately produced trivial solutions.

minor comments (2)

[Abstract] Abstract: the phrase 'characteristic plateau signatures' is used without a concrete definition or reference to the metric (e.g., stagnation threshold or gradient norm) employed to identify the six hardest functions.
[Results] Results section: while aggregate win counts are given, the text lacks a supplementary table of per-function error statistics or standard deviations across the 31 runs, which would strengthen the cross-function superiority claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for emphasizing the need for verifiable blackbox compliance in the autoresearch process. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (autoresearch loop description): the central performance claim of 510 wins under strict blackbox rules depends on the absence of any indirect channel that could have allowed the LLM to reconstruct GNBG compositional structure from pre-training or prompt presentation. The manuscript states that metadata was avoided, yet supplies no explicit verification (e.g., prompt templates, context-window contents, or ablation logs) that would permit independent confirmation; this verification is load-bearing because the authors themselves observed that any widening of the observation space immediately produced trivial solutions.

Authors: We agree that explicit verification materials are required for independent confirmation. In the revised manuscript we will expand §3 to include the complete prompt templates used across the thirty LLM experiments. These templates supplied the LLM exclusively with operator variant descriptions and aggregate fitness statistics (means and best values) obtained from blackbox evaluations; no GNBG compositional metadata or structural cues were present. We will also add representative context-window excerpts and expanded ablation logs documenting the immediate emergence of trivial solvers when metadata was introduced. These additions directly address the load-bearing concern and allow readers to confirm the restricted observation space used for the reported results. Regarding pre-training, the controlled contrast between metadata-inclusive and metadata-free runs provides the primary empirical safeguard. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external blackbox benchmark evaluations

full rationale

The paper reports performance from 31 independent runs per function on the GNBG benchmark using competition-specified evaluation budgets. Algorithm components (scout-augmented mutation, L-BFGS-B polish) were generated via an LLM loop restricted to operator edits and fitness observations only. The authors explicitly document rejecting wider observation spaces that produced trivial solutions violating blackbox rules. No equations, fitted parameters, or self-citations reduce the win counts or machine-precision claims to inputs by construction. The derivation chain is self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard blackbox optimization assumptions and the empirical effectiveness of an LLM design loop; no new physical entities are postulated and free parameters are typical algorithm hyperparameters rather than ad-hoc inventions.

free parameters (1)

algorithm hyperparameters including population size, mutation factors, and CMA-ES adaptation settings
These are standard tunable elements in differential evolution variants and were likely explored or set during the thirty LLM-driven design experiments.

axioms (1)

domain assumption GNBG benchmark functions must be treated strictly as blackbox with no access to internal compositional metadata or structure.
Invoked explicitly as the competition rule and as the boundary the authors respected after observing that metadata access trivialized the problem.

pith-pipeline@v0.9.0 · 5856 in / 1488 out tokens · 58093 ms · 2026-05-20T23:22:05.629351+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The mutation operator in ARES-LSHADE was produced by an LLM-driven autonomous research loop... edit surface was a single Python function, the mutation operator... observation space was: per-run fitness traces... λ, ω, and rotation
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The competition rules say the benchmark must be treated as a blackbox... seeding L-BFGS-B at component minima... removed before submission

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

def mutate_2(self, x=None, y=None, a=None) returns (x_mu, f_mu, r)

work page
[2]

Handle n = 4 with empty archive

LPSR: n_individuals shrinks from ∼180 to 4. Handle n = 4 with empty archive. r1 from range(n_individuals), r2 from range(len(x_un))

work page
[3]

Boundaries SCALAR: lb = float(np.asarray(self.lower_boundary).flat[0])

work page
[4]

lambda_ shape is (CompNum, 1) — use np.max(np.abs(np.asarray(self.lambda_)))

work page
[5]

Use Cauchy cap: for _attempt in range(100):

F > 0 always. Use Cauchy cap: for _attempt in range(100):

work page
[6]

analysis

n ≥ 6 guard before any CMA state: USE_CMA = n >= 6. WHAT ACTUALLY HELPS (based on analysis): • f6/f15: EA gap∼0.1 is enough — L-BFGS-B takes it to 10−15. Focus on REACHING basin. The plateau stagnation means the EA converges to wrong area. Need diversity. • f21: Optimum is near boundary. Add boundary-biased sampling when gap∼5.0. • f13: Multi-basin decept...

work page

[1] [1]

def mutate_2(self, x=None, y=None, a=None) returns (x_mu, f_mu, r)

work page

[2] [2]

Handle n = 4 with empty archive

LPSR: n_individuals shrinks from ∼180 to 4. Handle n = 4 with empty archive. r1 from range(n_individuals), r2 from range(len(x_un))

work page

[3] [3]

Boundaries SCALAR: lb = float(np.asarray(self.lower_boundary).flat[0])

work page

[4] [4]

lambda_ shape is (CompNum, 1) — use np.max(np.abs(np.asarray(self.lambda_)))

work page

[5] [5]

Use Cauchy cap: for _attempt in range(100):

F > 0 always. Use Cauchy cap: for _attempt in range(100):

work page

[6] [6]

analysis

n ≥ 6 guard before any CMA state: USE_CMA = n >= 6. WHAT ACTUALLY HELPS (based on analysis): • f6/f15: EA gap∼0.1 is enough — L-BFGS-B takes it to 10−15. Focus on REACHING basin. The plateau stagnation means the EA converges to wrong area. Need diversity. • f21: Optimum is near boundary. Add boundary-biased sampling when gap∼5.0. • f13: Multi-basin decept...

work page