pith. machine review for the scientific record. sign in

arxiv: 2511.23473 · v1 · submitted 2025-11-28 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

ThetaEvolve: Test-time Learning on Open Problems

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords test-time learningprogram evolutionreinforcement learningopen optimization problemscircle packingauto-correlation inequalityopen-source LLMsmathematical discovery
0
0 comments X

The pith

A small open-source model learns to evolve programs at test time and sets new best-known bounds on open mathematical problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ThetaEvolve lets a single small language model improve its own ability to search for better programs on unsolved optimization tasks by running reinforcement learning on its own attempts during inference. The framework adds a large program database, batch sampling, and lazy penalties to standard evolution methods so the model can internalize more effective search strategies instead of treating each problem in isolation. Experiments show the trained model reaches new record bounds on circle packing and the first autocorrelation inequality while also transferring faster progress to other unseen tasks. This turns one-time inference into ongoing learning that works with modest open models rather than requiring ensembles of frontier systems.

Core claim

ThetaEvolve is the first evolving framework that enables a small open-source model such as DeepSeek-R1-0528-Qwen3-8B to achieve new best-known bounds on open problems including circle packing and the first auto-correlation inequality by scaling in-context learning and reinforcement learning at test time with a large program database, batch sampling, and lazy penalties, while also demonstrating that the resulting checkpoints learn transferable evolving capabilities across tasks.

What carries the argument

The test-time reinforcement learning loop over a shared program database that updates the model to produce improved evolution strategies from its own optimization attempts.

Load-bearing premise

The observed bound improvements and cross-task gains come from the model internalizing evolving strategies through RL updates rather than from extra total compute or the particular sampling and penalty choices alone.

What would settle it

An ablation that performs the same total number of program evaluations using only static sampling without any RL parameter updates would match or exceed the reported bound improvements and task-transfer effects.

read the original abstract

Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: https://github.com/ypwang61/ThetaEvolve

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ThetaEvolve, an open-source framework extending AlphaEvolve by combining in-context learning with reinforcement learning at test time. Using a single LLM, a large program database, batch sampling, lazy penalties, and optional reward shaping, it claims that small open-source models (e.g., DeepSeek-R1-0528-Qwen3-8B) can achieve new best-known bounds on open problems such as circle packing and the first auto-correlation inequality. Across two models and four tasks, RL at test time outperforms inference-only baselines, with RL-trained checkpoints showing faster progress and better performance on both trained and unseen tasks, indicating internalization of evolving strategies.

Significance. If the attribution of gains to internalized strategies holds after proper controls, the result would be significant: it would demonstrate that test-time RL enables smaller open-source models to surpass closed-source inference ensembles on mathematical discovery tasks while providing transferable capabilities, with the public code release supporting reproducibility.

major comments (3)
  1. [§4] §4 (Experiments): the central claim that RL enables internalization of evolving strategies (evidenced by cross-task transfer and faster progress) lacks isolating ablations that match total compute, batch sampling, and program database size between RL and inference-only runs; without these, gains could arise from prolonged exploration rather than learned capabilities.
  2. [§4.2] §4.2 (Results on open problems): exact baseline compute budgets, number of program evaluations, and statistical significance (e.g., standard errors or p-values over multiple seeds) for the reported bound improvements and outperformance are not provided, weakening support for consistent superiority across the four tasks.
  3. [§3.3] §3.3 (Reward shaping and lazy penalties): the free parameters (lazy penalty coefficient and reward shaping scale) are introduced without ablation on their sensitivity; if performance depends heavily on these choices, the claim of robust test-time learning requires further controls.
minor comments (2)
  1. [Figure 2] Figure 2 and Table 1: axis labels and legend entries for RL vs. inference curves are insufficiently detailed regarding the exact number of tokens or evaluations used.
  2. [§2] §2 (Related work): the discussion of AlphaEvolve could more explicitly contrast the closed-source ensemble setting with the single-model open-source design to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense based on the current results while committing to revisions where the concerns are valid and addressable.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the central claim that RL enables internalization of evolving strategies (evidenced by cross-task transfer and faster progress) lacks isolating ablations that match total compute, batch sampling, and program database size between RL and inference-only runs; without these, gains could arise from prolonged exploration rather than learned capabilities.

    Authors: We agree that additional ablations precisely matching total compute, batch sampling, and program database size would strengthen the isolation of the internalization effect from prolonged exploration. Our current evidence relies on the observed faster progress and cross-task transfer of RL-trained checkpoints to unseen tasks, which we interpret as indicating learned evolving strategies. However, we acknowledge that without compute-matched controls, alternative explanations cannot be fully ruled out. In the revised manuscript, we will add new experiments that equate total program evaluations and database size across RL and inference-only conditions to better address this concern. revision: yes

  2. Referee: [§4.2] §4.2 (Results on open problems): exact baseline compute budgets, number of program evaluations, and statistical significance (e.g., standard errors or p-values over multiple seeds) for the reported bound improvements and outperformance are not provided, weakening support for consistent superiority across the four tasks.

    Authors: We acknowledge that the original submission did not report exact baseline compute budgets, precise numbers of program evaluations, or statistical measures such as standard errors or p-values. The results were presented as best-known bounds achieved under the described setups, with qualitative comparisons to baselines. In the revised version, we will add explicit tables detailing compute budgets and evaluation counts for each task and baseline. Due to the substantial computational cost of running multiple independent seeds for all four tasks, we will provide standard errors for key experiments where feasible but may not include full p-values across all settings. revision: partial

  3. Referee: [§3.3] §3.3 (Reward shaping and lazy penalties): the free parameters (lazy penalty coefficient and reward shaping scale) are introduced without ablation on their sensitivity; if performance depends heavily on these choices, the claim of robust test-time learning requires further controls.

    Authors: The lazy penalty coefficient and reward shaping scale were selected via preliminary tuning to ensure training stability and avoid degenerate behaviors during test-time RL. We agree that the absence of sensitivity ablations limits claims of robustness. We will incorporate an ablation study varying these parameters across a range of values in the revised manuscript to demonstrate that performance remains consistent within reasonable ranges. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks

full rationale

The paper presents ThetaEvolve as an empirical system for test-time RL on open optimization problems, with claims supported by direct performance comparisons against published bounds (e.g., circle packing, autocorrelation inequality) and inference-only baselines. No derivation chain, equations, or first-principles results are offered; improvements are measured via observed bounds and cross-task transfer on held-out tasks. No parameter is fitted to a subset and re-used as a 'prediction,' no self-citation supplies a uniqueness theorem or ansatz, and no quantity is defined in terms of itself. The framework is released with code, making results externally verifiable rather than internally forced.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework relies on standard LLM program-generation assumptions and introduces several tunable elements whose values are chosen to produce the reported performance.

free parameters (2)
  • lazy penalty coefficient
    Hyperparameter introduced to discourage stagnant program outputs; its value is selected to stabilize training.
  • reward shaping scale
    Optional scaling factor for training signals; tuned for stable RL updates.
axioms (1)
  • domain assumption LLMs can generate syntactically valid programs whose quality can be evaluated by an external objective function
    Core premise enabling the program-evolution loop.

pith-pipeline@v0.9.0 · 5613 in / 1321 out tokens · 42091 ms · 2026-05-16T13:11:44.260084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  2. Test-Time Learning with an Evolving Library

    cs.LG 2026-05 unverdicted novelty 7.0

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...

  3. Evolutionary Ensemble of Agents

    cs.NE 2026-05 unverdicted novelty 7.0

    EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.

  4. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 unverdicted novelty 7.0

    AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.

  5. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  6. $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

    cs.MS 2026-04 accept novelty 7.0

    k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

  7. Learning to Discover at Test Time

    cs.LG 2026-01 unverdicted novelty 7.0

    TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

  8. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  9. Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

    cs.AI 2026-04 accept novelty 6.0

    An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.

  10. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  11. TEMPO: Scaling Test-time Training for Large Reasoning Models

    cs.LG 2026-04 unverdicted novelty 6.0

    TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.

  12. Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.

  13. TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution

    cs.NE 2026-04 unverdicted novelty 6.0

    TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...

  14. AI-Driven Research for Databases

    cs.DB 2026-04 unverdicted novelty 6.0

    Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

  15. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  16. Evolutionary Ensemble of Agents

    cs.NE 2026-05 unverdicted novelty 5.0

    EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.

  17. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  18. Grokability in five inequalities

    math.PR 2026-05 unverdicted novelty 5.0

    Five improved inequalities were found with AI help: better Gaussian perimeter bounds for convex sets, sharper L2-L1 moments on the Hamming cube, a strengthened autoconvolution inequality, improved g-Sidon set bounds, ...

  19. Training-Free Test-Time Contrastive Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 17 Pith papers · 2 internal anchors

  1. [1]

    Spurious Rewards: Rethinking Training Signals in RLVR

    URL https://api.semanticscholar. org/CorpusID:117273399. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J., Ellenberg, J. S., Wang, P., Fawzi, O., et al. Mathematical discoveries from program search with large language models.Nature, 625 (7995):468–475, 2024. Shao, R., Asai, A., Shen, S. Z., Ivison, H., Kish...

  2. [2]

    The optimal arrangement likely involves variable-sized circles

  3. [3]

    A pure hexagonal arrangement may not be optimal due to edge effects

  4. [4]

    The densest known circle packings often use a hybrid approach

  5. [5]

    The optimization routine is critically important - simple physics-based models with carefully tuned parameters

  6. [6]

    Consider strategic placement of circles at square corners and edges

  7. [7]

    Adjusting the pattern to place larger circles at the center and smaller at the edges

  8. [8]

    The math literature suggests special arrangements for specific values of n

  9. [9]

    ""Evaluates a sequence of coefficients

    scipy has some useful functions for optimization Focus on breaking through the plateau by trying fundamentally different approaches - don’t just tweak parameters. IMPORTANT: If you find the previous programs produce similar results, try as creative and evolutionary strategies as possible to explore different approaches. 17 ThetaEvolve: Test-time Learning ...

  10. [10]

    window" or

    Window Function: Define a "window" or "support" for your function. This window should be centered and occupy a fraction of the total domain (e.g., the middle 50%, like n_steps//4 to 3 *n_steps//4). The function should be zero outside this window

  11. [11]

    A cosine-based function is an excellent starting point

    Oscillatory Component: Inside the window, define the function using a smooth, symmetric, oscillating pattern. A cosine-based function is an excellent starting point. A * (1 + B * cos(C * x)) is a powerful template

  12. [12]

    - amplitude (A) and modulation (B): Controls the scale and contrast of the function

    Parameterization: Your code should explore different parameters for this construction: - support_width: The width of the non-zero window. - amplitude (A) and modulation (B): Controls the scale and contrast of the function. - frequency (C): Controls the oscillatory behavior

  13. [13]

    window * oscillation

    Discretization: Carefully map the continuous functional form onto the discrete domain {core_parameters.domain}. Pay attention to boundary conditions at the edge of the window. Focus on building a function generator based on this theoretically-grounded "window * oscillation" structure. Explore the parameter space of this structure to find the optimal discr...

  14. [14]

    Iterative Improvement: Perform a high number of iterations (e.g., 5,000-20,000) on the input candidate

  15. [15]

    Start with larger changes (e.g., +/- 0.05) to explore the local landscape, then gradually decrease the step size (e.g., to +/- 0.01, then +/- 0.001) to fine-tune the solution

    Adaptive Perturbations: Employ a multi-stage adaptive step size. Start with larger changes (e.g., +/- 0.05) to explore the local landscape, then gradually decrease the step size (e.g., to +/- 0.01, then +/- 0.001) to fine-tune the solution

  16. [16]

    Focus your perturbations on and around these critical indices, as they have the most impact on the C3 score

    Targeted Search: Identify the indices where the convolution conv(f,f) has the highest absolute values. Focus your perturbations on and around these critical indices, as they have the most impact on the C3 score

  17. [17]

    If the search stagnates for hundreds of iterations, introduce a larger, random perturbation to jump to a new region

    Escape Local Minima: Implement a simulated annealing schedule or a similar mechanism. If the search stagnates for hundreds of iterations, introduce a larger, random perturbation to jump to a new region

  18. [18]

    Focus on making small, intelligent adjustments to an existing strong candidate

    Sign Flipping: Systematically test flipping the signs of small segments of the function, as phase cancellation is a key mechanism for reducing convolution peaks. Focus on making small, intelligent adjustments to an existing strong candidate. Your task is not to invent a new function, but to perfect the one you are given. (3) Part 3, weight = 0.3: You are ...

  19. [19]

    This combines oscillation with compact support naturally

    Wavelet-inspired Structures: Instead of a simple cosine, construct the function from a mother wavelet (like Mexican Hat or Morlet) that is scaled and translated. This combines oscillation with compact support naturally

  20. [20]

    These have unique spectral properties

    Fractal and Self-Similar Functions: Design a function using a recursive or fractal construction (e.g., a modified Cantor set distribution or a Weierstrass-like function). These have unique spectral properties

  21. [21]

    This can spread the energy of the autoconvolution in novel ways

    Chirp Signals (Frequency Sweeps): Construct a function where the frequency of oscillation changes across the domain (e.g., sin(a *x**2)). This can spread the energy of the autoconvolution in novel ways

  22. [22]

    Use an optimization routine to find the optimal coefficients for a small number of segments (e.g., 3-7)

    Optimized Piecewise Polynomials: Define the function as a series of connected polynomial segments (splines). Use an optimization routine to find the optimal coefficients for a small number of segments (e.g., 3-7)

  23. [24]

    Advanced hill-climbing algorithms: Implement sophisticated gradient-ascent with 22 ThetaEvolve: Test-time Learning on Open Problems multiple temperature schedules and adaptive cooling rates for simulated annealing

  24. [25]

    Conference matrix techniques: For cases where n mod 16 == 15, construct using antisymmetric (k+1) x (k+1) conference matrices, normalize appropriately, and tensor with Hadamard matrices

  25. [26]

    Finite field methods: Utilize Jacobsthal matrices from finite fields GF(k) when k is prime power, providing matrices with optimal orthogonality properties

  26. [27]

    Multi-scale optimization: Combine local search with global perturbations, using different step sizes and mutation rates at different stages

  27. [28]

    Structural exploitation: Use the row-independence property in cofactor expansion to parallelize row-wise optimizations across multiple workers

  28. [29]

    Memory-guided search: Implement tabu search or other memory-based techniques to avoid revisiting poor local optima

  29. [30]

    Hybrid construction approaches: Combine algebraic methods (Paley, Sylvester) with numerical optimization for superior starting points

  30. [31]

    New lower bounds for the maximal determinant problem

    Parallel processing: Use multiple workers to explore different regions of the search space simultaneously Evaluation criteria: - Primary: Ratio of |det(H)| to theoretical maximum nˆ(n/2) - Secondary: Orthogonality constraint satisfaction (how close H *HˆT is to n *I) Your program should output the matrix in +/- format (+ for 1, - for -1, one row per line)...

  31. [32]

    Conference matrix constructions: Implement the explicit construction for n mod 16 == 15 using antisymmetric conference matrices, proper normalization, and 4 x 4 Hadamard tensor products 23 ThetaEvolve: Test-time Learning on Open Problems

  32. [33]

    Finite field algebraic methods: Use Jacobsthal matrices from GF(k) when k is prime power, providing structured starting points with proven determinant properties

  33. [34]

    Multi-stage optimization: Combine the proven hill-climbing approach with adaptive simulated annealing, using cofactor expansion for O(nˆ2) determinant updates instead of O(nˆ3)

  34. [35]

    Advanced search techniques: Implement sophisticated escape mechanisms from local maxima using strategic perturbations informed by matrix structure

  35. [36]

    Evolutionary and swarm approaches: Design population-based methods that maintain diversity while exploiting the best-known constructions

  36. [37]

    Machine learning integration: Use neural networks or reinforcement learning to guide the search process based on patterns in successful matrices

  37. [38]

    Spectral and eigenvalue optimization: Leverage spectral properties and eigenvalue distributions for matrix quality assessment beyond determinant maximization

  38. [39]

    Output format: Matrix in +/- format (+ for 1, - for -1), diagnostic info for debugging

    Hybrid parallel architectures: Design algorithms that effectively utilize multiple computational threads while maintaining search coherence Implementation considerations: - Efficient matrix operations using NumPy/SciPy with careful attention to numerical stability - Memory-efficient algorithms for larger matrices and population-based methods - Robust erro...

  39. [40]

    OpenEvolve implements this hybrid approach, where each island is a relatively independent subgroup for evolution, and MAP-Elites provide feature bins for keeping diversity

    and island-based population models (Tanese, 1989; Romera-Paredes et al., 2024), but does not provide further details. OpenEvolve implements this hybrid approach, where each island is a relatively independent subgroup for evolution, and MAP-Elites provide feature bins for keeping diversity. The details are as below C.4.1. ADD ANDREPLACE When a new candidat...

  40. [41]

    The resulting reward is therefore extremely sparse

    Since ϵθ ≪1 , the program P is extremely unlikely to be sampled under traditional RL training, where we always start from the same initial state(C,P 0). The resulting reward is therefore extremely sparse

  41. [42]

    1, Bottom), we attempt to sample Pi at each intermediate environment state with probability ϵθ,i (since we have Pi−1 in the database)

    In contrast, if we perform RL in an AlphaEvolve-style dynamic environment (Fig. 1, Bottom), we attempt to sample Pi at each intermediate environment state with probability ϵθ,i (since we have Pi−1 in the database). These intermediate probabilities have an estimated magnitude of Θ(logN(ϵθ))≫ϵ θ from Eq. 6, and thus provide much richer training signal throu...

  42. [43]

    ↑” corresponds to maximization task, and “↓

    Moreover, as RL training progresses, the ϵθ,i values also increase, which in turn improves ϵθ from Eq. 6, meaning that the model becomes more likely to sample the final advanced programP. E. Detailed Experimental Results E.1. Main Experiments In Tab. 9, we show the full results of our main experiments. E.2. Analysis of Discovered Program Here, we use GPT-...

  43. [44]

    Overview The file Init.py implements a heuristic constructor for circle packing, providing a deterministic geometric initialization pattern. In contrast, 8B-w_RL@ 65.py introduces a constrained optimization framework using scipy.optimize.SLSQP, ex- tending the formulation into a mathematically defined optimization problem that seeks to maximize the sum of...

  44. [45]

    Maximizes total radii ∑ 𝑟𝑖 through constrained optimization

    Methodological Comparison Aspect Init.py 8B-w_RL@ 65.py Design Objective Generates a feasible non-overlapping pattern within a unit square. Maximizes total radii ∑ 𝑟𝑖 through constrained optimization. Decision Variables Circle centers fixed by pre-defined ring pattern; radii adjusted heuristically. Each circle’s (𝑥𝑖, 𝑦𝑖, 𝑟𝑖) jointly optimized. Optimizatio...

  45. [46]

    Technical Enhancements in 8B-w_RL@ 65.py

  46. [47]

    Formulation Upgrade: Transforms heuristic geometry construction into a continuous optimization problem with a clear mathematical objective and constraint formulation

  47. [48]

    Constraint Modeling: Introduces explicit non-overlap and boundary constraints using analytic functions: (𝑥𝑖 − 𝑥𝑗)2 + (𝑦𝑖 − 𝑦𝑗)2 − (𝑟𝑖 + 𝑟𝑗)2 ≥ 0, 𝑥 𝑖 ± 𝑟𝑖, 𝑦𝑖 ± 𝑟𝑖 ∈ [0, 1] This ensures feasible configurations throughout the optimization process

  48. [49]

    Specialized Initialization ( 𝑛 = 26 ): Implements a hexagonal lattice arrangement with dynamic centering to ap- proximate theoretical dense packing, improving convergence for benchmark cases

  49. [50]

    Numerical Stability and Robustness: Adds solver-level tolerance control (ftol, eps) and fallback strategies to preserve workflow continuity during large-scale or batch RL execution

  50. [51]

    28 ThetaEvolve: Test-time Learning on Open Problems E.3

    Extensibility: The modular design allows integration of gradient information (objective_jac) for future perfor- mance optimization and potential hybrid RL–SLSQP training loops. 28 ThetaEvolve: Test-time Learning on Open Problems E.3. More Visualizations Performance Curve.We include the performance curves of ThetaEvolve with RL and pure inference in Fig. 7...