arxiv: 2511.23473 · v1 · submitted 2025-11-28 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

ThetaEvolve: Test-time Learning on Open Problems

Yiping Wang , Shao-Rong Su , Zhiyuan Zeng , Eva Xu , Liliang Ren , Xinyu Yang , Zeyi Huang , Xuehai He

show 8 more authors

Luyao Ma Baolin Peng Hao Cheng Pengcheng He Weizhu Chen Shuohang Wang Simon Shaolei Du Yelong Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords test-time learningprogram evolutionreinforcement learningopen optimization problemscircle packingauto-correlation inequalityopen-source LLMsmathematical discovery

0 comments

The pith

A small open-source model learns to evolve programs at test time and sets new best-known bounds on open mathematical problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ThetaEvolve lets a single small language model improve its own ability to search for better programs on unsolved optimization tasks by running reinforcement learning on its own attempts during inference. The framework adds a large program database, batch sampling, and lazy penalties to standard evolution methods so the model can internalize more effective search strategies instead of treating each problem in isolation. Experiments show the trained model reaches new record bounds on circle packing and the first autocorrelation inequality while also transferring faster progress to other unseen tasks. This turns one-time inference into ongoing learning that works with modest open models rather than requiring ensembles of frontier systems.

Core claim

ThetaEvolve is the first evolving framework that enables a small open-source model such as DeepSeek-R1-0528-Qwen3-8B to achieve new best-known bounds on open problems including circle packing and the first auto-correlation inequality by scaling in-context learning and reinforcement learning at test time with a large program database, batch sampling, and lazy penalties, while also demonstrating that the resulting checkpoints learn transferable evolving capabilities across tasks.

What carries the argument

The test-time reinforcement learning loop over a shared program database that updates the model to produce improved evolution strategies from its own optimization attempts.

Load-bearing premise

The observed bound improvements and cross-task gains come from the model internalizing evolving strategies through RL updates rather than from extra total compute or the particular sampling and penalty choices alone.

What would settle it

An ablation that performs the same total number of program evaluations using only static sampling without any RL parameter updates would match or exceed the reported bound improvements and task-transfer effects.

read the original abstract

Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: https://github.com/ypwang61/ThetaEvolve

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

An 8B open model sets new bounds on circle packing and autocorrelation via test-time RL on top of AlphaEvolve, but the gains need ablations to separate RL learning from extra search budget and database effects.

read the letter

ThetaEvolve shows an 8B open model reaching new best-known bounds on circle packing and the first autocorrelation inequality by running RL at test time inside a program-evolution loop. The code is released, which makes the setup checkable right away. Across two models and four tasks the RL version beats the inference-only baselines, and the trained checkpoints move faster on both the original task and unseen ones. Those empirical points are the concrete advance over the closed-source AlphaEvolve work. The additions of a large program database, batch sampling, and lazy penalties look like practical engineering that helps exploration. The paper does a clean job of documenting the framework and showing consistent outperformance in the reported runs. The soft spot is the causal story. The claim that the model internalizes evolving strategies rests on the transfer results, yet the experiments do not include compute-matched controls that keep the database, batch size, and penalties fixed while disabling the RL updates. Without those ablations it remains possible that the extra progress comes from longer or better-tuned search rather than learned behavior. The abstract also leaves out statistical significance numbers and exact baseline compute budgets, so the strength of the learning claim is still moderate. This paper is for groups working on test-time adaptation and open-ended program synthesis. Readers who want reproducible experiments on mathematical discovery tasks will get direct value from the code and the new bounds. It deserves a serious referee because the results are new, the setup is open, and the gaps are fixable with targeted controls rather than fundamental flaws.

Referee Report

3 major / 2 minor

Summary. The paper introduces ThetaEvolve, an open-source framework extending AlphaEvolve by combining in-context learning with reinforcement learning at test time. Using a single LLM, a large program database, batch sampling, lazy penalties, and optional reward shaping, it claims that small open-source models (e.g., DeepSeek-R1-0528-Qwen3-8B) can achieve new best-known bounds on open problems such as circle packing and the first auto-correlation inequality. Across two models and four tasks, RL at test time outperforms inference-only baselines, with RL-trained checkpoints showing faster progress and better performance on both trained and unseen tasks, indicating internalization of evolving strategies.

Significance. If the attribution of gains to internalized strategies holds after proper controls, the result would be significant: it would demonstrate that test-time RL enables smaller open-source models to surpass closed-source inference ensembles on mathematical discovery tasks while providing transferable capabilities, with the public code release supporting reproducibility.

major comments (3)

[§4] §4 (Experiments): the central claim that RL enables internalization of evolving strategies (evidenced by cross-task transfer and faster progress) lacks isolating ablations that match total compute, batch sampling, and program database size between RL and inference-only runs; without these, gains could arise from prolonged exploration rather than learned capabilities.
[§4.2] §4.2 (Results on open problems): exact baseline compute budgets, number of program evaluations, and statistical significance (e.g., standard errors or p-values over multiple seeds) for the reported bound improvements and outperformance are not provided, weakening support for consistent superiority across the four tasks.
[§3.3] §3.3 (Reward shaping and lazy penalties): the free parameters (lazy penalty coefficient and reward shaping scale) are introduced without ablation on their sensitivity; if performance depends heavily on these choices, the claim of robust test-time learning requires further controls.

minor comments (2)

[Figure 2] Figure 2 and Table 1: axis labels and legend entries for RL vs. inference curves are insufficiently detailed regarding the exact number of tokens or evaluations used.
[§2] §2 (Related work): the discussion of AlphaEvolve could more explicitly contrast the closed-source ensemble setting with the single-model open-source design to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense based on the current results while committing to revisions where the concerns are valid and addressable.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central claim that RL enables internalization of evolving strategies (evidenced by cross-task transfer and faster progress) lacks isolating ablations that match total compute, batch sampling, and program database size between RL and inference-only runs; without these, gains could arise from prolonged exploration rather than learned capabilities.

Authors: We agree that additional ablations precisely matching total compute, batch sampling, and program database size would strengthen the isolation of the internalization effect from prolonged exploration. Our current evidence relies on the observed faster progress and cross-task transfer of RL-trained checkpoints to unseen tasks, which we interpret as indicating learned evolving strategies. However, we acknowledge that without compute-matched controls, alternative explanations cannot be fully ruled out. In the revised manuscript, we will add new experiments that equate total program evaluations and database size across RL and inference-only conditions to better address this concern. revision: yes
Referee: [§4.2] §4.2 (Results on open problems): exact baseline compute budgets, number of program evaluations, and statistical significance (e.g., standard errors or p-values over multiple seeds) for the reported bound improvements and outperformance are not provided, weakening support for consistent superiority across the four tasks.

Authors: We acknowledge that the original submission did not report exact baseline compute budgets, precise numbers of program evaluations, or statistical measures such as standard errors or p-values. The results were presented as best-known bounds achieved under the described setups, with qualitative comparisons to baselines. In the revised version, we will add explicit tables detailing compute budgets and evaluation counts for each task and baseline. Due to the substantial computational cost of running multiple independent seeds for all four tasks, we will provide standard errors for key experiments where feasible but may not include full p-values across all settings. revision: partial
Referee: [§3.3] §3.3 (Reward shaping and lazy penalties): the free parameters (lazy penalty coefficient and reward shaping scale) are introduced without ablation on their sensitivity; if performance depends heavily on these choices, the claim of robust test-time learning requires further controls.

Authors: The lazy penalty coefficient and reward shaping scale were selected via preliminary tuning to ensure training stability and avoid degenerate behaviors during test-time RL. We agree that the absence of sensitivity ablations limits claims of robustness. We will incorporate an ablation study varying these parameters across a range of values in the revised manuscript to demonstrate that performance remains consistent within reasonable ranges. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks

full rationale

The paper presents ThetaEvolve as an empirical system for test-time RL on open optimization problems, with claims supported by direct performance comparisons against published bounds (e.g., circle packing, autocorrelation inequality) and inference-only baselines. No derivation chain, equations, or first-principles results are offered; improvements are measured via observed bounds and cross-task transfer on held-out tasks. No parameter is fitted to a subset and re-used as a 'prediction,' no self-citation supplies a uniqueness theorem or ansatz, and no quantity is defined in terms of itself. The framework is released with code, making results externally verifiable rather than internally forced.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework relies on standard LLM program-generation assumptions and introduces several tunable elements whose values are chosen to produce the reported performance.

free parameters (2)

lazy penalty coefficient
Hyperparameter introduced to discourage stagnant program outputs; its value is selected to stabilize training.
reward shaping scale
Optional scaling factor for training signals; tuned for stable RL updates.

axioms (1)

domain assumption LLMs can generate syntactically valid programs whose quality can be evaluated by an external objective function
Core premise enabling the program-evolution loop.

pith-pipeline@v0.9.0 · 5613 in / 1321 out tokens · 42091 ms · 2026-05-16T13:11:44.260084+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 7.0

EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
cs.MS 2026-04 accept novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
Learning to Discover at Test Time
cs.LG 2026-01 unverdicted novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
cs.AI 2026-04 accept novelty 6.0

An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
cs.AI 2026-04 unverdicted novelty 6.0

EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution
cs.NE 2026-04 unverdicted novelty 6.0

TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...
AI-Driven Research for Databases
cs.DB 2026-04 unverdicted novelty 6.0

Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 5.0

EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
Grokability in five inequalities
math.PR 2026-05 unverdicted novelty 5.0

Five improved inequalities were found with AI help: better Gaussian perimeter bounds for convex sets, sharper L2-L1 moments on the Hamming cube, a strengthened autoconvolution inequality, improved g-Sidon set bounds, ...
Training-Free Test-Time Contrastive Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 17 Pith papers · 2 internal anchors

[1]

Spurious Rewards: Rethinking Training Signals in RLVR

URL https://api.semanticscholar. org/CorpusID:117273399. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J., Ellenberg, J. S., Wang, P., Fawzi, O., et al. Mathematical discoveries from program search with large language models.Nature, 625 (7995):468–475, 2024. Shao, R., Asai, A., Shen, S. Z., Ivison, H., Kish...

work page internal anchor Pith review arXiv 2024
[2]

The optimal arrangement likely involves variable-sized circles

work page
[3]

A pure hexagonal arrangement may not be optimal due to edge effects

work page
[4]

The densest known circle packings often use a hybrid approach

work page
[5]

The optimization routine is critically important - simple physics-based models with carefully tuned parameters

work page
[6]

Consider strategic placement of circles at square corners and edges

work page
[7]

Adjusting the pattern to place larger circles at the center and smaller at the edges

work page
[8]

The math literature suggests special arrangements for specific values of n

work page
[9]

""Evaluates a sequence of coefficients

scipy has some useful functions for optimization Focus on breaking through the plateau by trying fundamentally different approaches - don’t just tweak parameters. IMPORTANT: If you find the previous programs produce similar results, try as creative and evolutionary strategies as possible to explore different approaches. 17 ThetaEvolve: Test-time Learning ...

work page 2025
[10]

window" or

Window Function: Define a "window" or "support" for your function. This window should be centered and occupy a fraction of the total domain (e.g., the middle 50%, like n_steps//4 to 3 *n_steps//4). The function should be zero outside this window

work page
[11]

A cosine-based function is an excellent starting point

Oscillatory Component: Inside the window, define the function using a smooth, symmetric, oscillating pattern. A cosine-based function is an excellent starting point. A * (1 + B * cos(C * x)) is a powerful template

work page
[12]

- amplitude (A) and modulation (B): Controls the scale and contrast of the function

Parameterization: Your code should explore different parameters for this construction: - support_width: The width of the non-zero window. - amplitude (A) and modulation (B): Controls the scale and contrast of the function. - frequency (C): Controls the oscillatory behavior

work page
[13]

window * oscillation

Discretization: Carefully map the continuous functional form onto the discrete domain {core_parameters.domain}. Pay attention to boundary conditions at the edge of the window. Focus on building a function generator based on this theoretically-grounded "window * oscillation" structure. Explore the parameter space of this structure to find the optimal discr...

work page
[14]

Iterative Improvement: Perform a high number of iterations (e.g., 5,000-20,000) on the input candidate

work page
[15]

Start with larger changes (e.g., +/- 0.05) to explore the local landscape, then gradually decrease the step size (e.g., to +/- 0.01, then +/- 0.001) to fine-tune the solution

Adaptive Perturbations: Employ a multi-stage adaptive step size. Start with larger changes (e.g., +/- 0.05) to explore the local landscape, then gradually decrease the step size (e.g., to +/- 0.01, then +/- 0.001) to fine-tune the solution

work page
[16]

Focus your perturbations on and around these critical indices, as they have the most impact on the C3 score

Targeted Search: Identify the indices where the convolution conv(f,f) has the highest absolute values. Focus your perturbations on and around these critical indices, as they have the most impact on the C3 score

work page
[17]

If the search stagnates for hundreds of iterations, introduce a larger, random perturbation to jump to a new region

Escape Local Minima: Implement a simulated annealing schedule or a similar mechanism. If the search stagnates for hundreds of iterations, introduce a larger, random perturbation to jump to a new region

work page
[18]

Focus on making small, intelligent adjustments to an existing strong candidate

Sign Flipping: Systematically test flipping the signs of small segments of the function, as phase cancellation is a key mechanism for reducing convolution peaks. Focus on making small, intelligent adjustments to an existing strong candidate. Your task is not to invent a new function, but to perfect the one you are given. (3) Part 3, weight = 0.3: You are ...

work page
[19]

This combines oscillation with compact support naturally

Wavelet-inspired Structures: Instead of a simple cosine, construct the function from a mother wavelet (like Mexican Hat or Morlet) that is scaled and translated. This combines oscillation with compact support naturally

work page
[20]

These have unique spectral properties

Fractal and Self-Similar Functions: Design a function using a recursive or fractal construction (e.g., a modified Cantor set distribution or a Weierstrass-like function). These have unique spectral properties

work page
[21]

This can spread the energy of the autoconvolution in novel ways

Chirp Signals (Frequency Sweeps): Construct a function where the frequency of oscillation changes across the domain (e.g., sin(a *x**2)). This can spread the energy of the autoconvolution in novel ways

work page
[22]

Use an optimization routine to find the optimal coefficients for a small number of segments (e.g., 3-7)

Optimized Piecewise Polynomials: Define the function as a series of connected polynomial segments (splines). Use an optimization routine to find the optimal coefficients for a small number of segments (e.g., 3-7)

work page
[24]

Advanced hill-climbing algorithms: Implement sophisticated gradient-ascent with 22 ThetaEvolve: Test-time Learning on Open Problems multiple temperature schedules and adaptive cooling rates for simulated annealing

work page
[25]

Conference matrix techniques: For cases where n mod 16 == 15, construct using antisymmetric (k+1) x (k+1) conference matrices, normalize appropriately, and tensor with Hadamard matrices

work page
[26]

Finite field methods: Utilize Jacobsthal matrices from finite fields GF(k) when k is prime power, providing matrices with optimal orthogonality properties

work page
[27]

Multi-scale optimization: Combine local search with global perturbations, using different step sizes and mutation rates at different stages

work page
[28]

Structural exploitation: Use the row-independence property in cofactor expansion to parallelize row-wise optimizations across multiple workers

work page
[29]

Memory-guided search: Implement tabu search or other memory-based techniques to avoid revisiting poor local optima

work page
[30]

Hybrid construction approaches: Combine algebraic methods (Paley, Sylvester) with numerical optimization for superior starting points

work page
[31]

New lower bounds for the maximal determinant problem

Parallel processing: Use multiple workers to explore different regions of the search space simultaneously Evaluation criteria: - Primary: Ratio of |det(H)| to theoretical maximum nˆ(n/2) - Secondary: Orthogonality constraint satisfaction (how close H *HˆT is to n *I) Your program should output the matrix in +/- format (+ for 1, - for -1, one row per line)...

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Conference matrix constructions: Implement the explicit construction for n mod 16 == 15 using antisymmetric conference matrices, proper normalization, and 4 x 4 Hadamard tensor products 23 ThetaEvolve: Test-time Learning on Open Problems

work page
[33]

Finite field algebraic methods: Use Jacobsthal matrices from GF(k) when k is prime power, providing structured starting points with proven determinant properties

work page
[34]

Multi-stage optimization: Combine the proven hill-climbing approach with adaptive simulated annealing, using cofactor expansion for O(nˆ2) determinant updates instead of O(nˆ3)

work page
[35]

Advanced search techniques: Implement sophisticated escape mechanisms from local maxima using strategic perturbations informed by matrix structure

work page
[36]

Evolutionary and swarm approaches: Design population-based methods that maintain diversity while exploiting the best-known constructions

work page
[37]

Machine learning integration: Use neural networks or reinforcement learning to guide the search process based on patterns in successful matrices

work page
[38]

Spectral and eigenvalue optimization: Leverage spectral properties and eigenvalue distributions for matrix quality assessment beyond determinant maximization

work page
[39]

Output format: Matrix in +/- format (+ for 1, - for -1), diagnostic info for debugging

Hybrid parallel architectures: Design algorithms that effectively utilize multiple computational threads while maintaining search coherence Implementation considerations: - Efficient matrix operations using NumPy/SciPy with careful attention to numerical stability - Memory-efficient algorithms for larger matrices and population-based methods - Robust erro...

work page 2025
[40]

OpenEvolve implements this hybrid approach, where each island is a relatively independent subgroup for evolution, and MAP-Elites provide feature bins for keeping diversity

and island-based population models (Tanese, 1989; Romera-Paredes et al., 2024), but does not provide further details. OpenEvolve implements this hybrid approach, where each island is a relatively independent subgroup for evolution, and MAP-Elites provide feature bins for keeping diversity. The details are as below C.4.1. ADD ANDREPLACE When a new candidat...

work page 1989
[41]

The resulting reward is therefore extremely sparse

Since ϵθ ≪1 , the program P is extremely unlikely to be sampled under traditional RL training, where we always start from the same initial state(C,P 0). The resulting reward is therefore extremely sparse

work page
[42]

1, Bottom), we attempt to sample Pi at each intermediate environment state with probability ϵθ,i (since we have Pi−1 in the database)

In contrast, if we perform RL in an AlphaEvolve-style dynamic environment (Fig. 1, Bottom), we attempt to sample Pi at each intermediate environment state with probability ϵθ,i (since we have Pi−1 in the database). These intermediate probabilities have an estimated magnitude of Θ(logN(ϵθ))≫ϵ θ from Eq. 6, and thus provide much richer training signal throu...

work page
[43]

↑” corresponds to maximization task, and “↓

Moreover, as RL training progresses, the ϵθ,i values also increase, which in turn improves ϵθ from Eq. 6, meaning that the model becomes more likely to sample the final advanced programP. E. Detailed Experimental Results E.1. Main Experiments In Tab. 9, we show the full results of our main experiments. E.2. Analysis of Discovered Program Here, we use GPT-...

work page arXiv
[44]

Overview The file Init.py implements a heuristic constructor for circle packing, providing a deterministic geometric initialization pattern. In contrast, 8B-w_RL@ 65.py introduces a constrained optimization framework using scipy.optimize.SLSQP, ex- tending the formulation into a mathematically defined optimization problem that seeks to maximize the sum of...

work page
[45]

Maximizes total radii ∑ 𝑟𝑖 through constrained optimization

Methodological Comparison Aspect Init.py 8B-w_RL@ 65.py Design Objective Generates a feasible non-overlapping pattern within a unit square. Maximizes total radii ∑ 𝑟𝑖 through constrained optimization. Decision Variables Circle centers fixed by pre-defined ring pattern; radii adjusted heuristically. Each circle’s (𝑥𝑖, 𝑦𝑖, 𝑟𝑖) jointly optimized. Optimizatio...

work page
[46]

Technical Enhancements in 8B-w_RL@ 65.py

work page
[47]

Formulation Upgrade: Transforms heuristic geometry construction into a continuous optimization problem with a clear mathematical objective and constraint formulation

work page
[48]

Constraint Modeling: Introduces explicit non-overlap and boundary constraints using analytic functions: (𝑥𝑖 − 𝑥𝑗)2 + (𝑦𝑖 − 𝑦𝑗)2 − (𝑟𝑖 + 𝑟𝑗)2 ≥ 0, 𝑥 𝑖 ± 𝑟𝑖, 𝑦𝑖 ± 𝑟𝑖 ∈ [0, 1] This ensures feasible configurations throughout the optimization process

work page
[49]

Specialized Initialization ( 𝑛 = 26 ): Implements a hexagonal lattice arrangement with dynamic centering to ap- proximate theoretical dense packing, improving convergence for benchmark cases

work page
[50]

Numerical Stability and Robustness: Adds solver-level tolerance control (ftol, eps) and fallback strategies to preserve workflow continuity during large-scale or batch RL execution

work page
[51]

28 ThetaEvolve: Test-time Learning on Open Problems E.3

Extensibility: The modular design allows integration of gradient information (objective_jac) for future perfor- mance optimization and potential hybrid RL–SLSQP training loops. 28 ThetaEvolve: Test-time Learning on Open Problems E.3. More Visualizations Performance Curve.We include the performance curves of ThetaEvolve with RL and pure inference in Fig. 7...

work page