arxiv: 2604.19761 · v1 · submitted 2026-03-26 · 💻 cs.AI · cs.LG· cs.NE

EvoForest: A Novel Machine-Learning Paradigm via Open-Ended Evolution of Computational Graphs

Kamer Ali Yuksel , Hassan Sawaf This is my paper

Pith reviewed 2026-05-15 01:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.NE

keywords machine learningevolutionary algorithmsneuro-symbolic systemscomputational graphsopen-ended evolutiondirected acyclic graphsstructural break detection

0 comments p. Extension

The pith

EvoForest evolves entire computational graphs using LLM mutations to discover predictive structures that outperform fixed-model weight optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern machine learning typically selects a model family and tunes its parameters, but this approach struggles when success requires inventing the right transformations, interactions, or summaries from data rather than fitting weights. EvoForest addresses this by maintaining a shared directed acyclic graph in which nodes hold alternative implementations, callable nodes define reusable families of transformations such as gates and projections, and output nodes represent candidate computations; persistent parameters within the graph can still be refined by gradient descent. Each candidate graph is scored by a lightweight Ridge readout against a cross-validation target, and the resulting structured feedback drives LLM-based mutations that alter the graph topology and node contents. In the 2025 ADIA Lab Structural Break Challenge the system reached 94.13 percent ROC-AUC after 600 steps, exceeding the publicly reported winning score of 90.14 percent under identical evaluation. A reader would care because the method offers an automated route to complex, interpretable computations for objectives that are non-differentiable or require continual structural adaptation.

Core claim

EvoForest is a hybrid neuro-symbolic system for end-to-end open-ended evolution of computation. Rather than merely generating features, it jointly evolves reusable computational structure, callable function families, and trainable low-dimensional continuous components inside a shared directed acyclic graph. Intermediate nodes store alternative implementations, callable nodes encode reusable transformation families such as projections, gates, and activations, output nodes define candidate predictive computations, and persistent global parameters can be refined by gradient descent. For each graph configuration, EvoForest evaluates the discovered computation and uses a lightweight Ridge-based 1

What carries the argument

A shared directed acyclic graph whose intermediate nodes store alternative implementations, callable nodes encode reusable transformation families, and output nodes represent candidate predictions; the graph is mutated by LLMs guided by structured feedback from a Ridge readout on cross-validation scores.

Load-bearing premise

LLM-driven mutations guided by Ridge readout feedback on cross-validation scores will reliably discover superior computational structures without excessive overfitting to the specific challenge or evaluator.

What would settle it

Apply EvoForest unchanged to an entirely new prediction task with different data distribution and non-differentiable metric; if the evolved graphs fail to exceed strong fixed-architecture baselines by a comparable margin, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.19761 by Hassan Sawaf, Kamer Ali Yuksel.

**Figure 1.** Figure 1: EvoForest architecture. EvoForest DAG (left), two-phase evaluation pipeline (center), LLMguided scientist/engineer mutation loop (right), and asynchronous island model (bottom). B Benchmark Comparison [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

**Figure 2.** Figure 2: Optimization and search-space diagnostics. (a) Best-so-far ROC–AUC. (b) Graph complexity over time. (c) Per-island scores. (d) Configuration count. 45 50 55 60 65 70 75 Number of Nodes 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Best Config AUC v01 v02 v03 v04 v06 v05 v07 v08 v10 v12 v11 v09 v13 v14 v19 v15 v16v17 v18 v20 v212 v23 v24 v25 v27v26 v28 v29 v30 v31 v32 v33 v35 v34 v37 v36 v38 v40 v39 v41 v42 v43 v45 v4… view at source ↗

**Figure 3.** Figure 3: Structural and systems diagnostics. (a) Score vs. complexity. (b) Structural changes. (c) Acceptance rates. (d) Wall-clock time. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Additional diagnostics. (a) Discoveries by island. (b) Best–mean gap. (c) Score statistics. (d) Score variance. D Evolved Graph Excerpt 1 '@globals': 2 rp_W: torch.randn(8, 4) * (1.0 / 8.0 ** 0.5) 3 rp_b: torch.zeros(4) 4 mr_base_w: torch.randn(16, 9) * (1.0 / 9.0 ** 0.5) 5 mr_kmask: torch.ones(16, 1, 9) 6 mr_dil: torch.tensor([1,1,1,1,2,2,2,2,4,4,4,4,8,8,8,8]) 7 rff_W: torch.randn(256, 8) * 0.1 8 rff_b: t… view at source ↗

read the original abstract

Modern machine learning is still largely organized around a single recipe: choose a parameterized model family and optimize its weights. Although highly successful, this paradigm is too narrow for many structured prediction problems, where the main bottleneck is not parameter fitting but discovering what should be computed from the data. Success often depends on identifying the right transformations, statistics, invariances, interaction structures, temporal summaries, gates, or nonlinear compositions, especially when objectives are non-differentiable, evaluation is cross-validation-based, interpretability matters, or continual adaptation is required. We present EvoForest, a hybrid neuro-symbolic system for end-to-end open-ended evolution of computation. Rather than merely generating features, EvoForest jointly evolves reusable computational structure, callable function families, and trainable low-dimensional continuous components inside a shared directed acyclic graph. Intermediate nodes store alternative implementations, callable nodes encode reusable transformation families such as projections, gates, and activations, output nodes define candidate predictive computations, and persistent global parameters can be refined by gradient descent. For each graph configuration, EvoForest evaluates the discovered computation and uses a lightweight Ridge-based readout to score the resulting representation against a non-differentiable cross-validation target. The evaluator also produces structured feedback that guides future LLM-driven mutations. In the 2025 ADIA Lab Structural Break Challenge, EvoForest reached 94.13% ROC-AUC after 600 evolution steps, exceeding the publicly reported winning score of 90.14% under the same evaluation protocol.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoForest evolves DAGs of callable nodes with LLM mutations guided by Ridge CV feedback and claims a clear win on the ADIA structural-break challenge, but the abstract gives almost no controls or configuration details.

read the letter

The paper's core claim is that EvoForest can discover better computational structures than standard methods by running an open-ended evolutionary search over a shared DAG. Nodes hold alternative implementations or reusable callable families, global parameters stay available for gradient tuning, and an LLM proposes mutations based on structured feedback from a lightweight Ridge readout fitted to cross-validation scores. On the 2025 ADIA Lab challenge it reports 94.13% ROC-AUC after 600 steps, above the prior public winner at 90.14% under the same protocol. That integration of persistent structure, callable reuse, and non-differentiable scoring feedback is the concrete new piece; it is not just another neuroevolution variant or feature generator.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoForest, a hybrid neuro-symbolic system for open-ended evolution of computational graphs. It evolves reusable DAG structures with callable nodes (encoding transformations, gates, and activations) and persistent global parameters via LLM-driven mutations; each candidate graph is scored by a lightweight Ridge readout on cross-validation performance against a non-differentiable target, with the same readout supplying structured feedback to guide subsequent mutations. The central empirical claim is that after 600 evolution steps on the 2025 ADIA Lab Structural Break Challenge, EvoForest attains 94.13% ROC-AUC, exceeding the publicly reported winning score of 90.14% under identical evaluation.

Significance. If the performance result is shown to be robust, the work would constitute a meaningful step toward paradigms that discover computational structure rather than merely optimizing parameters within a fixed family. The joint evolution of discrete graph topology, reusable function families, and continuous parameters inside a single DAG, together with the use of non-differentiable CV feedback, addresses a recognized limitation of gradient-only methods on structured or interpretable prediction tasks.

major comments (2)

[Abstract / Results] Abstract and Results section: the headline claim of 94.13% ROC-AUC after exactly 600 steps is presented without error bars, variance across independent runs, ablation of the LLM-mutation or Ridge-feedback components, or explicit configuration details (population size, mutation rate, Ridge regularization strength). Because this single scalar is the sole quantitative support for superiority over the 90.14% baseline, the absence of these controls renders the central empirical claim unverifiable from the manuscript.
[Method] Method section (description of the evaluation loop): the Ridge readout is simultaneously used to compute the fitness score that ranks graphs and to generate the structured feedback that conditions the LLM mutations. This creates an explicit dependence between the discovered structures and the fitted readout parameters; no control experiment (e.g., frozen readout, alternative feedback channel, or transfer to a second structural-break task) is reported to demonstrate that the margin is not an artifact of this closed loop.

minor comments (2)

[Method] The notation for persistent global parameters and the distinction between intermediate, callable, and output nodes would be clearer if accompanied by a small schematic diagram in the first methods subsection.
[Abstract] The manuscript does not state the precise public leaderboard protocol (data splits, exact ROC-AUC computation) against which the 90.14% figure is compared; a one-sentence reference or footnote would eliminate ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that the current presentation of the empirical results requires additional statistical controls and ablation studies to make the central claims fully verifiable. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the headline claim of 94.13% ROC-AUC after exactly 600 steps is presented without error bars, variance across independent runs, ablation of the LLM-mutation or Ridge-feedback components, or explicit configuration details (population size, mutation rate, Ridge regularization strength). Because this single scalar is the sole quantitative support for superiority over the 90.14% baseline, the absence of these controls renders the central empirical claim unverifiable from the manuscript.

Authors: We agree that the single scalar result is insufficient without supporting statistics and controls. In the revised manuscript we will add: (i) mean and standard deviation of ROC-AUC across at least five independent evolutionary runs with different random seeds; (ii) ablation experiments that disable the LLM-driven mutation operator and the Ridge-based feedback channel separately; and (iii) a complete hyper-parameter table listing population size, mutation rate, number of evolution steps, and Ridge regularization strength. These additions will allow readers to assess the robustness of the reported improvement over the 90.14% baseline. revision: yes
Referee: [Method] Method section (description of the evaluation loop): the Ridge readout is simultaneously used to compute the fitness score that ranks graphs and to generate the structured feedback that conditions the LLM mutations. This creates an explicit dependence between the discovered structures and the fitted readout parameters; no control experiment (e.g., frozen readout, alternative feedback channel, or transfer to a second structural-break task) is reported to demonstrate that the margin is not an artifact of this closed loop.

Authors: The referee correctly identifies a potential closed-loop artifact. We will add two control experiments to the revised manuscript: (1) a frozen-readout variant in which the Ridge parameters are held fixed after an initial fit and only the graph topology and callable nodes continue to evolve, and (2) an alternative-feedback variant that replaces the Ridge-derived structured feedback with a simple scalar fitness signal. We will also report performance when the best evolved graphs are transferred to a second structural-break dataset. These controls will clarify the contribution of the joint evolution versus the specific feedback mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance from open-ended search with external challenge metric

full rationale

The paper presents EvoForest as an algorithmic system that evolves computational graphs via LLM mutations, scores them with a Ridge readout on cross-validation, and reports the resulting ROC-AUC on the ADIA Lab Structural Break Challenge. No first-principles derivation, uniqueness theorem, or mathematical claim is made that reduces to its own inputs by construction. The performance number is an observed outcome of running the described procedure for 600 steps against an externally defined challenge target; the Ridge component is an internal scoring mechanism whose parameters are not renamed as a prediction of the final result. No self-citations appear in the provided text, and the method remains self-contained as a hybrid neuro-symbolic search process without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the described hybrid evolution loop can discover useful structure; the abstract introduces the EvoForest system itself as the main invented entity without independent evidence outside the reported score.

free parameters (2)

number of evolution steps
600 steps are chosen to reach the reported score; the value is not derived from first principles.
Ridge regularization strength
Used in the readout but its specific value is not stated and must be fitted or chosen per graph.

axioms (1)

domain assumption LLM mutations guided by evaluator feedback produce useful structural changes
Invoked implicitly when describing how future graphs are generated from current scores.

invented entities (1)

EvoForest DAG with callable nodes and persistent global parameters no independent evidence
purpose: To jointly evolve reusable computational structure and trainable components
New named architecture introduced to support the open-ended evolution claim; no external falsifiable prediction is given.

pith-pipeline@v0.9.0 · 5567 in / 1511 out tokens · 82794 ms · 2026-05-15T01:17:47.167610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

configuration

EVOFOREST STRUCTURE (CONFIGURATION-BASED PARADIGM) - A EvoForest is a DAG of nodes. Each node contains one or more alternatives, each implemented as a Python lambda function. TWO KINDS OF NODES: (A) INTERMEDIATE NODES: alternatives are COMPETING implementations. A "configuration" selects one alternative per intermediate node. (B) OUTPUT NODE ("output"): a...

work page
[2]

Each output alternative is an expert in the ensemble; intermediate and callable nodes define alternative ways those experts are built and combined

ENSEMBLE EVOLUTION OBJECTIVE Build a diverse ensemble of complementary, high-quality predictors: individually strong output features that capture different aspects and combine well without redundancy. Each output alternative is an expert in the ensemble; intermediate and callable nodes define alternative ways those experts are built and combined

work page
[3]

- **Exploit** strong alternatives (high max / mean)

QUALITY-DIVERSITY SEARCH DYNAMICS Treat each alternative as a micro-program with ROLE, GENETIC LINEAGE, PHENOTYPE IMPACT (statistics), and DESIGN PATTERN. - **Exploit** strong alternatives (high max / mean). - **Explore** weak or underrepresented alternatives. - **Preserve diversity** by maintaining multiple distinct strategies. THINK OUTSIDE THE BOX! Pro...

work page
[4]

CODE-LEVEL CROSSOVER

work page
[5]

Intra-node crossover: fuse strong alternatives of the same node

work page
[6]

@globals

Cross-node crossover: encapsulate recurring multi-step motifs. 5b. @globals -- PERSISTENT TRAINABLE PARAMETERS The "@globals" node holds learnable tensor parameters that persist across evolution steps. You may ADD new entries but NEVER modify or remove existing entries (append-only). Supply an init expression; 16 the system wraps it in nn.Parameter and tr...

work page 2020