Recognition: 2 theorem links
· Lean TheoremThetaEvolve: Test-time Learning on Open Problems
Pith reviewed 2026-05-16 13:11 UTC · model grok-4.3
The pith
A small open-source model learns to evolve programs at test time and sets new best-known bounds on open mathematical problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ThetaEvolve is the first evolving framework that enables a small open-source model such as DeepSeek-R1-0528-Qwen3-8B to achieve new best-known bounds on open problems including circle packing and the first auto-correlation inequality by scaling in-context learning and reinforcement learning at test time with a large program database, batch sampling, and lazy penalties, while also demonstrating that the resulting checkpoints learn transferable evolving capabilities across tasks.
What carries the argument
The test-time reinforcement learning loop over a shared program database that updates the model to produce improved evolution strategies from its own optimization attempts.
Load-bearing premise
The observed bound improvements and cross-task gains come from the model internalizing evolving strategies through RL updates rather than from extra total compute or the particular sampling and penalty choices alone.
What would settle it
An ablation that performs the same total number of program evaluations using only static sampling without any RL parameter updates would match or exceed the reported bound improvements and task-transfer effects.
read the original abstract
Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: https://github.com/ypwang61/ThetaEvolve
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ThetaEvolve, an open-source framework extending AlphaEvolve by combining in-context learning with reinforcement learning at test time. Using a single LLM, a large program database, batch sampling, lazy penalties, and optional reward shaping, it claims that small open-source models (e.g., DeepSeek-R1-0528-Qwen3-8B) can achieve new best-known bounds on open problems such as circle packing and the first auto-correlation inequality. Across two models and four tasks, RL at test time outperforms inference-only baselines, with RL-trained checkpoints showing faster progress and better performance on both trained and unseen tasks, indicating internalization of evolving strategies.
Significance. If the attribution of gains to internalized strategies holds after proper controls, the result would be significant: it would demonstrate that test-time RL enables smaller open-source models to surpass closed-source inference ensembles on mathematical discovery tasks while providing transferable capabilities, with the public code release supporting reproducibility.
major comments (3)
- [§4] §4 (Experiments): the central claim that RL enables internalization of evolving strategies (evidenced by cross-task transfer and faster progress) lacks isolating ablations that match total compute, batch sampling, and program database size between RL and inference-only runs; without these, gains could arise from prolonged exploration rather than learned capabilities.
- [§4.2] §4.2 (Results on open problems): exact baseline compute budgets, number of program evaluations, and statistical significance (e.g., standard errors or p-values over multiple seeds) for the reported bound improvements and outperformance are not provided, weakening support for consistent superiority across the four tasks.
- [§3.3] §3.3 (Reward shaping and lazy penalties): the free parameters (lazy penalty coefficient and reward shaping scale) are introduced without ablation on their sensitivity; if performance depends heavily on these choices, the claim of robust test-time learning requires further controls.
minor comments (2)
- [Figure 2] Figure 2 and Table 1: axis labels and legend entries for RL vs. inference curves are insufficiently detailed regarding the exact number of tokens or evaluations used.
- [§2] §2 (Related work): the discussion of AlphaEvolve could more explicitly contrast the closed-source ensemble setting with the single-model open-source design to clarify novelty.
Simulated Author's Rebuttal
We appreciate the referee's detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense based on the current results while committing to revisions where the concerns are valid and addressable.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the central claim that RL enables internalization of evolving strategies (evidenced by cross-task transfer and faster progress) lacks isolating ablations that match total compute, batch sampling, and program database size between RL and inference-only runs; without these, gains could arise from prolonged exploration rather than learned capabilities.
Authors: We agree that additional ablations precisely matching total compute, batch sampling, and program database size would strengthen the isolation of the internalization effect from prolonged exploration. Our current evidence relies on the observed faster progress and cross-task transfer of RL-trained checkpoints to unseen tasks, which we interpret as indicating learned evolving strategies. However, we acknowledge that without compute-matched controls, alternative explanations cannot be fully ruled out. In the revised manuscript, we will add new experiments that equate total program evaluations and database size across RL and inference-only conditions to better address this concern. revision: yes
-
Referee: [§4.2] §4.2 (Results on open problems): exact baseline compute budgets, number of program evaluations, and statistical significance (e.g., standard errors or p-values over multiple seeds) for the reported bound improvements and outperformance are not provided, weakening support for consistent superiority across the four tasks.
Authors: We acknowledge that the original submission did not report exact baseline compute budgets, precise numbers of program evaluations, or statistical measures such as standard errors or p-values. The results were presented as best-known bounds achieved under the described setups, with qualitative comparisons to baselines. In the revised version, we will add explicit tables detailing compute budgets and evaluation counts for each task and baseline. Due to the substantial computational cost of running multiple independent seeds for all four tasks, we will provide standard errors for key experiments where feasible but may not include full p-values across all settings. revision: partial
-
Referee: [§3.3] §3.3 (Reward shaping and lazy penalties): the free parameters (lazy penalty coefficient and reward shaping scale) are introduced without ablation on their sensitivity; if performance depends heavily on these choices, the claim of robust test-time learning requires further controls.
Authors: The lazy penalty coefficient and reward shaping scale were selected via preliminary tuning to ensure training stability and avoid degenerate behaviors during test-time RL. We agree that the absence of sensitivity ablations limits claims of robustness. We will incorporate an ablation study varying these parameters across a range of values in the revised manuscript to demonstrate that performance remains consistent within reasonable ranges. revision: yes
Circularity Check
No circularity: empirical framework with external benchmarks
full rationale
The paper presents ThetaEvolve as an empirical system for test-time RL on open optimization problems, with claims supported by direct performance comparisons against published bounds (e.g., circle packing, autocorrelation inequality) and inference-only baselines. No derivation chain, equations, or first-principles results are offered; improvements are measured via observed bounds and cross-task transfer on held-out tasks. No parameter is fitted to a subset and re-used as a 'prediction,' no self-citation supplies a uniqueness theorem or ansatz, and no quantity is defined in terms of itself. The framework is released with code, making results externally verifiable rather than internally forced.
Axiom & Free-Parameter Ledger
free parameters (2)
- lazy penalty coefficient
- reward shaping scale
axioms (1)
- domain assumption LLMs can generate syntactically valid programs whose quality can be evaluated by an external objective function
Forward citations
Cited by 19 Pith papers
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
Evolutionary Ensemble of Agents
EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
-
Learning to Discover at Test Time
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
-
TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution
TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...
-
AI-Driven Research for Databases
Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
Evolutionary Ensemble of Agents
EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
Grokability in five inequalities
Five improved inequalities were found with AI help: better Gaussian perimeter bounds for convex sets, sharper L2-L1 moments on the Hamming cube, a strengthened autoconvolution inequality, improved g-Sidon set bounds, ...
-
Training-Free Test-Time Contrastive Learning for Large Language Models
TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.
Reference graph
Works this paper leans on
-
[1]
Spurious Rewards: Rethinking Training Signals in RLVR
URL https://api.semanticscholar. org/CorpusID:117273399. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J., Ellenberg, J. S., Wang, P., Fawzi, O., et al. Mathematical discoveries from program search with large language models.Nature, 625 (7995):468–475, 2024. Shao, R., Asai, A., Shen, S. Z., Ivison, H., Kish...
work page internal anchor Pith review arXiv 2024
-
[2]
The optimal arrangement likely involves variable-sized circles
-
[3]
A pure hexagonal arrangement may not be optimal due to edge effects
-
[4]
The densest known circle packings often use a hybrid approach
-
[5]
The optimization routine is critically important - simple physics-based models with carefully tuned parameters
-
[6]
Consider strategic placement of circles at square corners and edges
-
[7]
Adjusting the pattern to place larger circles at the center and smaller at the edges
-
[8]
The math literature suggests special arrangements for specific values of n
-
[9]
""Evaluates a sequence of coefficients
scipy has some useful functions for optimization Focus on breaking through the plateau by trying fundamentally different approaches - don’t just tweak parameters. IMPORTANT: If you find the previous programs produce similar results, try as creative and evolutionary strategies as possible to explore different approaches. 17 ThetaEvolve: Test-time Learning ...
work page 2025
-
[10]
Window Function: Define a "window" or "support" for your function. This window should be centered and occupy a fraction of the total domain (e.g., the middle 50%, like n_steps//4 to 3 *n_steps//4). The function should be zero outside this window
-
[11]
A cosine-based function is an excellent starting point
Oscillatory Component: Inside the window, define the function using a smooth, symmetric, oscillating pattern. A cosine-based function is an excellent starting point. A * (1 + B * cos(C * x)) is a powerful template
-
[12]
- amplitude (A) and modulation (B): Controls the scale and contrast of the function
Parameterization: Your code should explore different parameters for this construction: - support_width: The width of the non-zero window. - amplitude (A) and modulation (B): Controls the scale and contrast of the function. - frequency (C): Controls the oscillatory behavior
-
[13]
Discretization: Carefully map the continuous functional form onto the discrete domain {core_parameters.domain}. Pay attention to boundary conditions at the edge of the window. Focus on building a function generator based on this theoretically-grounded "window * oscillation" structure. Explore the parameter space of this structure to find the optimal discr...
-
[14]
Iterative Improvement: Perform a high number of iterations (e.g., 5,000-20,000) on the input candidate
-
[15]
Adaptive Perturbations: Employ a multi-stage adaptive step size. Start with larger changes (e.g., +/- 0.05) to explore the local landscape, then gradually decrease the step size (e.g., to +/- 0.01, then +/- 0.001) to fine-tune the solution
-
[16]
Targeted Search: Identify the indices where the convolution conv(f,f) has the highest absolute values. Focus your perturbations on and around these critical indices, as they have the most impact on the C3 score
-
[17]
Escape Local Minima: Implement a simulated annealing schedule or a similar mechanism. If the search stagnates for hundreds of iterations, introduce a larger, random perturbation to jump to a new region
-
[18]
Focus on making small, intelligent adjustments to an existing strong candidate
Sign Flipping: Systematically test flipping the signs of small segments of the function, as phase cancellation is a key mechanism for reducing convolution peaks. Focus on making small, intelligent adjustments to an existing strong candidate. Your task is not to invent a new function, but to perfect the one you are given. (3) Part 3, weight = 0.3: You are ...
-
[19]
This combines oscillation with compact support naturally
Wavelet-inspired Structures: Instead of a simple cosine, construct the function from a mother wavelet (like Mexican Hat or Morlet) that is scaled and translated. This combines oscillation with compact support naturally
-
[20]
These have unique spectral properties
Fractal and Self-Similar Functions: Design a function using a recursive or fractal construction (e.g., a modified Cantor set distribution or a Weierstrass-like function). These have unique spectral properties
-
[21]
This can spread the energy of the autoconvolution in novel ways
Chirp Signals (Frequency Sweeps): Construct a function where the frequency of oscillation changes across the domain (e.g., sin(a *x**2)). This can spread the energy of the autoconvolution in novel ways
-
[22]
Optimized Piecewise Polynomials: Define the function as a series of connected polynomial segments (splines). Use an optimization routine to find the optimal coefficients for a small number of segments (e.g., 3-7)
-
[24]
Advanced hill-climbing algorithms: Implement sophisticated gradient-ascent with 22 ThetaEvolve: Test-time Learning on Open Problems multiple temperature schedules and adaptive cooling rates for simulated annealing
-
[25]
Conference matrix techniques: For cases where n mod 16 == 15, construct using antisymmetric (k+1) x (k+1) conference matrices, normalize appropriately, and tensor with Hadamard matrices
-
[26]
Finite field methods: Utilize Jacobsthal matrices from finite fields GF(k) when k is prime power, providing matrices with optimal orthogonality properties
-
[27]
Multi-scale optimization: Combine local search with global perturbations, using different step sizes and mutation rates at different stages
-
[28]
Structural exploitation: Use the row-independence property in cofactor expansion to parallelize row-wise optimizations across multiple workers
-
[29]
Memory-guided search: Implement tabu search or other memory-based techniques to avoid revisiting poor local optima
-
[30]
Hybrid construction approaches: Combine algebraic methods (Paley, Sylvester) with numerical optimization for superior starting points
-
[31]
New lower bounds for the maximal determinant problem
Parallel processing: Use multiple workers to explore different regions of the search space simultaneously Evaluation criteria: - Primary: Ratio of |det(H)| to theoretical maximum nˆ(n/2) - Secondary: Orthogonality constraint satisfaction (how close H *HˆT is to n *I) Your program should output the matrix in +/- format (+ for 1, - for -1, one row per line)...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Conference matrix constructions: Implement the explicit construction for n mod 16 == 15 using antisymmetric conference matrices, proper normalization, and 4 x 4 Hadamard tensor products 23 ThetaEvolve: Test-time Learning on Open Problems
-
[33]
Finite field algebraic methods: Use Jacobsthal matrices from GF(k) when k is prime power, providing structured starting points with proven determinant properties
-
[34]
Multi-stage optimization: Combine the proven hill-climbing approach with adaptive simulated annealing, using cofactor expansion for O(nˆ2) determinant updates instead of O(nˆ3)
-
[35]
Advanced search techniques: Implement sophisticated escape mechanisms from local maxima using strategic perturbations informed by matrix structure
-
[36]
Evolutionary and swarm approaches: Design population-based methods that maintain diversity while exploiting the best-known constructions
-
[37]
Machine learning integration: Use neural networks or reinforcement learning to guide the search process based on patterns in successful matrices
-
[38]
Spectral and eigenvalue optimization: Leverage spectral properties and eigenvalue distributions for matrix quality assessment beyond determinant maximization
-
[39]
Output format: Matrix in +/- format (+ for 1, - for -1), diagnostic info for debugging
Hybrid parallel architectures: Design algorithms that effectively utilize multiple computational threads while maintaining search coherence Implementation considerations: - Efficient matrix operations using NumPy/SciPy with careful attention to numerical stability - Memory-efficient algorithms for larger matrices and population-based methods - Robust erro...
work page 2025
-
[40]
and island-based population models (Tanese, 1989; Romera-Paredes et al., 2024), but does not provide further details. OpenEvolve implements this hybrid approach, where each island is a relatively independent subgroup for evolution, and MAP-Elites provide feature bins for keeping diversity. The details are as below C.4.1. ADD ANDREPLACE When a new candidat...
work page 1989
-
[41]
The resulting reward is therefore extremely sparse
Since ϵθ ≪1 , the program P is extremely unlikely to be sampled under traditional RL training, where we always start from the same initial state(C,P 0). The resulting reward is therefore extremely sparse
-
[42]
In contrast, if we perform RL in an AlphaEvolve-style dynamic environment (Fig. 1, Bottom), we attempt to sample Pi at each intermediate environment state with probability ϵθ,i (since we have Pi−1 in the database). These intermediate probabilities have an estimated magnitude of Θ(logN(ϵθ))≫ϵ θ from Eq. 6, and thus provide much richer training signal throu...
-
[43]
↑” corresponds to maximization task, and “↓
Moreover, as RL training progresses, the ϵθ,i values also increase, which in turn improves ϵθ from Eq. 6, meaning that the model becomes more likely to sample the final advanced programP. E. Detailed Experimental Results E.1. Main Experiments In Tab. 9, we show the full results of our main experiments. E.2. Analysis of Discovered Program Here, we use GPT-...
-
[44]
Overview The file Init.py implements a heuristic constructor for circle packing, providing a deterministic geometric initialization pattern. In contrast, 8B-w_RL@ 65.py introduces a constrained optimization framework using scipy.optimize.SLSQP, ex- tending the formulation into a mathematically defined optimization problem that seeks to maximize the sum of...
-
[45]
Maximizes total radii ∑ 𝑟𝑖 through constrained optimization
Methodological Comparison Aspect Init.py 8B-w_RL@ 65.py Design Objective Generates a feasible non-overlapping pattern within a unit square. Maximizes total radii ∑ 𝑟𝑖 through constrained optimization. Decision Variables Circle centers fixed by pre-defined ring pattern; radii adjusted heuristically. Each circle’s (𝑥𝑖, 𝑦𝑖, 𝑟𝑖) jointly optimized. Optimizatio...
-
[46]
Technical Enhancements in 8B-w_RL@ 65.py
-
[47]
Formulation Upgrade: Transforms heuristic geometry construction into a continuous optimization problem with a clear mathematical objective and constraint formulation
-
[48]
Constraint Modeling: Introduces explicit non-overlap and boundary constraints using analytic functions: (𝑥𝑖 − 𝑥𝑗)2 + (𝑦𝑖 − 𝑦𝑗)2 − (𝑟𝑖 + 𝑟𝑗)2 ≥ 0, 𝑥 𝑖 ± 𝑟𝑖, 𝑦𝑖 ± 𝑟𝑖 ∈ [0, 1] This ensures feasible configurations throughout the optimization process
-
[49]
Specialized Initialization ( 𝑛 = 26 ): Implements a hexagonal lattice arrangement with dynamic centering to ap- proximate theoretical dense packing, improving convergence for benchmark cases
-
[50]
Numerical Stability and Robustness: Adds solver-level tolerance control (ftol, eps) and fallback strategies to preserve workflow continuity during large-scale or batch RL execution
-
[51]
28 ThetaEvolve: Test-time Learning on Open Problems E.3
Extensibility: The modular design allows integration of gradient information (objective_jac) for future perfor- mance optimization and potential hybrid RL–SLSQP training loops. 28 ThetaEvolve: Test-time Learning on Open Problems E.3. More Visualizations Performance Curve.We include the performance curves of ThetaEvolve with RL and pure inference in Fig. 7...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.