ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

Henrik Voigt; Joachim Giesen; Michael Habeck; Paul Kahlmeyer

arxiv: 2606.09276 · v1 · pith:T575GYKLnew · submitted 2026-06-08 · 💻 cs.LG

ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

Paul Kahlmeyer , Henrik Voigt , Michael Habeck , Joachim Giesen This is my paper

Pith reviewed 2026-06-27 17:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords equation discoverysymbolic regressionbenchmarkequation recoveryrobustnessmachine learning evaluationscientific modeling

0 comments

The pith

ERBench supplies a large suite of groundtruth equations and systematic robustness tests to evaluate symbolic regression algorithms for equation discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ERBench as a benchmark that shifts evaluation of symbolic regression from simple in-domain accuracy to equation recovery across changing dimensionality, sample sizes, distributions, and domains. It argues that recovery of known formulas serves as a practical proxy for an algorithm's ability to discover new scientific models from noisy real-world data. Existing benchmarks use too few public formulas and do not stress-test robustness, leaving practitioners without reliable ways to select methods for modeling natural phenomena. By providing this framework, the work aims to make comparisons between algorithms more rigorous and more relevant to actual discovery tasks.

Core claim

Equation discovery is best evaluated by how reliably symbolic regression recovers known groundtruth formulas rather than by in-domain prediction accuracy alone; ERBench implements this by supplying many groundtruth equations together with controlled variations in dimensionality, sampling size, sampling distribution, sampling domain, and noise levels.

What carries the argument

ERBench, an evaluation framework that measures equation recovery performance under controlled changes in data conditions.

If this is right

Practitioners can rank symbolic regression methods by their robustness to realistic variations in data rather than by interpolation performance.
Development of new algorithms can target explicit recovery metrics instead of test-set accuracy alone.
Benchmarks can expand by adding more groundtruth equations without changing the evaluation protocol.
Comparisons across papers become standardized when all methods are tested on the same recovery tasks with the same condition variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to include partial or incomplete observations to test discovery under missing data regimes common in experiments.
If recovery correlates with discovery success, the same test suite could serve as a filter before deploying methods on high-stakes scientific modeling problems.
Standardized recovery scores might accelerate consensus on which algorithmic components (operators, search strategies, regularization) actually drive discovery performance.

Load-bearing premise

Algorithms that recover known groundtruth formulas well will also succeed at discovering unknown equations from real data.

What would settle it

An experiment in which the algorithms that score highest on ERBench recovery tasks fail to recover or discover accurate equations on a fresh collection of scientific datasets not included in the benchmark.

Figures

Figures reproduced from arXiv: 2606.09276 by Henrik Voigt, Joachim Giesen, Michael Habeck, Paul Kahlmeyer.

**Figure 1.** Figure 1: Generalization ability of a symbolic equation compared to a polynomial regressor. The figure contrasts the out-of-domain generalization capabilities of a proposed symbolic equation against a polynomial model. The ground truth is Newton’s Law of Cooling, T(t) = Tenv+ (T0−Tenv)e −kt, which describes the temperature of an object over time starting at temperature T0 and cooling down to the temperature of the s… view at source ↗

**Figure 2.** Figure 2: A taxonomy of symbolic regression workflows, distinguishing between pre-training-based and traditional approaches. The process can be divided into three distinct stages. (1) Pre-Training: A phase unique to deep learning-based algorithms, where a general-purpose model learns underlying patterns from a vast corpus of equations (e.g., x 2 ,sin(x)) and correspondingly sampled synthetic data. (2) Training: The … view at source ↗

**Figure 3.** Figure 3: Performance artifacts of state-of-the-art search-based algorithms. Given samples from a ground-truth equation, we measure the percentage of correctly recovered equations. Left: The recovery performance of symbolic regression algorithms degrades when the representation complexity of the ground truth formulas increases. Shown here is the fraction of successfully recovered formulas from the Feynman equations … view at source ↗

**Figure 4.** Figure 4: The success of pre-training-based symbolic regression algorithms depends on the pre-training distribution. Left: We transform 100 equally spaced points x0 in [0, 1], into pre-training data points by using the transformation 10 · x0 − 5 for SymFormer, (x0 − x¯0)/σx0 for E2E, and 100x0 /10 for TF4SR. Right: Percentage of correctly recovered ground truth for the three pre-training based algorithms on the func… view at source ↗

**Figure 5.** Figure 5: Communication protocol for the secret evaluation set. Left: On user request, the website generates a unique request ID that serves as a seed for permuting the evaluation set. The user then has access to the permuted regression problems (without ground truth) and the submission ID. Right: Once the user submits expressions and the request ID, the website re-generates the permuted evaluation set and calculate… view at source ↗

**Figure 6.** Figure 6: Diagnostic analysis of PySR’s performance. The plots reveal the specific conditions under which the algorithm succeeds or fails. (a) Recovery rate against problem complexity (number of operators). (b) Performance trend with respect to data distribution. (c) Performance trend with respect to noise level. (d) Performance trend with respect to sample size. While PySR achieves near-perfect recovery on low-comp… view at source ↗

**Figure 7.** Figure 7: Frequency of complexities and input dimensions in the SynEq dataset. Complexity [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Iterative interval expansion for the equation [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Samples from the diverse sampling strategy for [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

read the original abstract

Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ERBench adds robustness tests to symbolic regression evaluation but rests on an unbacked claim that recovering known formulas predicts success at discovering new ones.

read the letter

ERBench proposes a benchmark for equation discovery that stresses robustness across dimensionality, sampling size, distribution, and domain. The main point is that it chooses to evaluate via recovery of known ground-truth equations instead of out-of-domain prediction.

The paper correctly notes that random in-domain splits are a weak proxy for discovery and that true out-of-domain test data is hard to obtain. It also points out that existing benchmarks use only a small number of public formulas and pay less attention to how algorithms behave under changes in data characteristics that matter for real modeling tasks.

The central weakness is the rationale itself. The paper states that algorithms good at recovering known formulas are good candidates for unknown discovery, yet supplies no argument, citation, or small-scale check to support why this proxy should hold. If methods can succeed by fitting patterns in the finite set of public expressions or by exploiting benchmark regularities, the transfer to novel scientific models is not guaranteed. This assumption carries the evaluation framework.

No derivations or fitted quantities appear, and the abstract gives no implementation details or baseline results. If the full paper ships the actual test suite with code and some validation runs, that would make the contribution more concrete.

Researchers working on symbolic regression algorithms would find this worth examining to see how their methods perform under the added robustness conditions, provided the benchmark is released in usable form. It deserves peer review because benchmarks shape evaluation standards in the subfield even when their premises need scrutiny.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes ERBench, a benchmark and testsuite for evaluating symbolic regression algorithms on equation discovery. It centers evaluation on recovery of known groundtruth formulas rather than out-of-domain prediction accuracy, on the grounds that recovery success is a suitable proxy for discovery capability on unknown models. The design emphasizes robustness testing across dimensionality, sample size, sampling distribution, and domain, while critiquing prior benchmarks for limited public formulas and insufficient robustness coverage.

Significance. If the proxy assumption can be justified and the benchmark implemented with reproducible test suites, ERBench could provide a standardized, practitioner-oriented evaluation framework that prioritizes robustness properties relevant to scientific modeling. The focus on multiple data regimes is a constructive direction for the field.

major comments (1)

[Abstract] Abstract (and introduction): The central design choice rests on the claim that 'symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery.' No argument, citation, or preliminary result is supplied to support transfer from recovery of public formulas to novel scientific models; if recovery can be achieved by overfitting to the finite set of known expressions, the benchmark would not rigorously assess discovery capability.

minor comments (1)

The abstract presents the benchmark rationale and motivation but contains no implementation details, validation experiments, or quantitative results. A dedicated section describing the concrete test-suite construction, ground-truth selection criteria, and planned evaluation protocol would strengthen the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the central design choice of ERBench. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (and introduction): The central design choice rests on the claim that 'symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery.' No argument, citation, or preliminary result is supplied to support transfer from recovery of public formulas to novel scientific models; if recovery can be achieved by overfitting to the finite set of known expressions, the benchmark would not rigorously assess discovery capability.

Authors: We agree that the manuscript states the rationale for using recovery of known groundtruth formulas as a proxy without supplying supporting citations, arguments, or preliminary results. The provided text motivates the choice primarily by noting the difficulty of obtaining challenging out-of-domain test data. In the revised manuscript we will expand the abstract and introduction to include additional discussion of this assumption, reference prior symbolic regression benchmarks that similarly rely on recovery metrics, and explicitly address the limitation of potential overfitting to the finite set of public expressions. We will also note how the benchmark's emphasis on robustness across dimensionality, sample size, distribution, and domain is intended to mitigate some of these risks. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with no derived results or self-referential steps

full rationale

The paper introduces ERBench as an evaluation framework for symbolic regression algorithms. Its justification rests on an explicit rationale statement rather than any derivation, equation, fitted parameter, or self-citation chain. No load-bearing step reduces to its own inputs by construction; the work contains no predictions, uniqueness theorems, or ansatzes that could exhibit circularity. This is the expected outcome for a benchmark proposal that does not claim to derive new scientific results from its own data or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about the validity of equation recovery as a proxy metric.

axioms (1)

domain assumption symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery
Explicitly stated in the abstract as the rationale for focusing evaluation on equation recovery rather than out-of-domain prediction.

pith-pipeline@v0.9.1-grok · 5820 in / 1203 out tokens · 27299 ms · 2026-06-27T17:03:02.949164+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 4 canonical work pages

[1]

doi: 10.24963/ijcai.2024/471

International Joint Conferences on Artificial Intelligence Organization, 8 2024. doi: 10.24963/ijcai.2024/471. URLhttps://doi.org/10.24963/ijcai.2024/471. Main Track. Paul Kahlmeyer, Markus Fischer, and Joachim Giesen. Dimension reduction for symbolic regression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 17707–17714...

work page doi:10.24963/ijcai.2024/471 2024
[2]

Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, and Yoshitaka Ushiku

URLhttps://cavalab.org/srbench/competition-2022/. Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, and Yoshitaka Ushiku. A transformer model for symbolic regression towards scientific discovery. InNeurIPS 2023 AI for Science Workshop, 2023. URLhttps://openreview.net/forum? id=AIfqWNHKjo. Guillaume Lample and François Char...

work page doi:10.1016/j.inffus.2024.102681 2022
[3]

Randal Olson, William La Cava, Patryk Orzechowski, Ryan Urbanowicz, and Jason Moore

[Online; accessed 24-April-2025]. Randal Olson, William La Cava, Patryk Orzechowski, Ryan Urbanowicz, and Jason Moore. Pmlb: A large benchmark suite for machine learning evaluation and comparison.BioData Mining, 10, 12 2017. doi: 10.1186/s13040-017-0154-4. Patryk Orzechowski, William La Cava, and Jason H. Moore. Where are we now? a large benchmark study o...

work page doi:10.1186/s13040-017-0154-4 2025
[4]

doi:https://doi.org/10.1038/s41592-019-0686-2 , journal =

doi: 10.1038/s41592-019-0686-2. [Online; accessed 24-April-2025]. Ekaterina (Katya) Vladislavleva, Guido Smits, and Dick den Hertog. Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming.Evolutionary Computation, IEEE Transactions on, 13:333 – 349, 05 2009. doi: 10.1109/TEVC.2008.926486. W...

work page doi:10.1038/s41592-019-0686-2 2025
[5]

Set the initial box to the box of zero-width located at this point

Sample from a large box and find one valid point. Set the initial box to the box of zero-width located at this point
[6]

For each expansion, sample from the additional space and check the validity of the samples

From the current box, gradually expand the space around each dimension. For each expansion, sample from the additional space and check the validity of the samples
[7]

If an expansion fails, the step size is reduced until a minimum step size is reached or if the expansion is valid
[8]

Based upon this algorithm, we were able to generate valid intervals for all the SynEq formulas

If any dimension has been expanded and the maximum number of expansions is not exceeded, goto 2. Based upon this algorithm, we were able to generate valid intervals for all the SynEq formulas. Sampling Points.In Section 2 of the paper, we discussed the dependency of pre-training based algorithms on their pre-training distribution. A consequence of this de...

2022
[9]

Sample the number of mixture partsk∼Cat(1, K)
[10]

Sample the mixture weights[wi ∼U(0,1)] i=1,...,k
[11]

Normalize the weight vectorw:=w/||w||
[12]

For each mixture component: 35 Kahlmeyer et al. (a) Randomly choose a distribution from[U,N] (b) If uniform, randomly select a sub-interval[b, c]⊆[a, b] (c) If normal, randomly select a mean fromµ∼U (a, b)and set σ = min(b−µ, µ−a)/3 (d) Generate⌊w i ·N⌋samples (e) Clip the samples to[a, b]
[13]

An 10.0 7.5 5.0 2.5 0.0 2.5 5.0 x0 0 20 40 60 80 100y Figure 9: Samples from the diverse sampling strategy fory = x2

Generate the remaining sample points from the first mixture component In our implementation, we set the maximum number of mixture components toK = 3. An 10.0 7.5 5.0 2.5 0.0 2.5 5.0 x0 0 20 40 60 80 100y Figure 9: Samples from the diverse sampling strategy fory = x2
[14]

example of samples generated for the functiony=x2 0 is shown in Figure 9

Points are sampled from a mixture model, where the mixture components are either uniform or gaussian. example of samples generated for the functiony=x2 0 is shown in Figure 9. 36

[1] [1]

doi: 10.24963/ijcai.2024/471

International Joint Conferences on Artificial Intelligence Organization, 8 2024. doi: 10.24963/ijcai.2024/471. URLhttps://doi.org/10.24963/ijcai.2024/471. Main Track. Paul Kahlmeyer, Markus Fischer, and Joachim Giesen. Dimension reduction for symbolic regression. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 17707–17714...

work page doi:10.24963/ijcai.2024/471 2024

[2] [2]

Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, and Yoshitaka Ushiku

URLhttps://cavalab.org/srbench/competition-2022/. Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, and Yoshitaka Ushiku. A transformer model for symbolic regression towards scientific discovery. InNeurIPS 2023 AI for Science Workshop, 2023. URLhttps://openreview.net/forum? id=AIfqWNHKjo. Guillaume Lample and François Char...

work page doi:10.1016/j.inffus.2024.102681 2022

[3] [3]

Randal Olson, William La Cava, Patryk Orzechowski, Ryan Urbanowicz, and Jason Moore

[Online; accessed 24-April-2025]. Randal Olson, William La Cava, Patryk Orzechowski, Ryan Urbanowicz, and Jason Moore. Pmlb: A large benchmark suite for machine learning evaluation and comparison.BioData Mining, 10, 12 2017. doi: 10.1186/s13040-017-0154-4. Patryk Orzechowski, William La Cava, and Jason H. Moore. Where are we now? a large benchmark study o...

work page doi:10.1186/s13040-017-0154-4 2025

[4] [4]

doi:https://doi.org/10.1038/s41592-019-0686-2 , journal =

doi: 10.1038/s41592-019-0686-2. [Online; accessed 24-April-2025]. Ekaterina (Katya) Vladislavleva, Guido Smits, and Dick den Hertog. Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming.Evolutionary Computation, IEEE Transactions on, 13:333 – 349, 05 2009. doi: 10.1109/TEVC.2008.926486. W...

work page doi:10.1038/s41592-019-0686-2 2025

[5] [5]

Set the initial box to the box of zero-width located at this point

Sample from a large box and find one valid point. Set the initial box to the box of zero-width located at this point

[6] [6]

For each expansion, sample from the additional space and check the validity of the samples

From the current box, gradually expand the space around each dimension. For each expansion, sample from the additional space and check the validity of the samples

[7] [7]

If an expansion fails, the step size is reduced until a minimum step size is reached or if the expansion is valid

[8] [8]

Based upon this algorithm, we were able to generate valid intervals for all the SynEq formulas

If any dimension has been expanded and the maximum number of expansions is not exceeded, goto 2. Based upon this algorithm, we were able to generate valid intervals for all the SynEq formulas. Sampling Points.In Section 2 of the paper, we discussed the dependency of pre-training based algorithms on their pre-training distribution. A consequence of this de...

2022

[9] [9]

Sample the number of mixture partsk∼Cat(1, K)

[10] [10]

Sample the mixture weights[wi ∼U(0,1)] i=1,...,k

[11] [11]

Normalize the weight vectorw:=w/||w||

[12] [12]

For each mixture component: 35 Kahlmeyer et al. (a) Randomly choose a distribution from[U,N] (b) If uniform, randomly select a sub-interval[b, c]⊆[a, b] (c) If normal, randomly select a mean fromµ∼U (a, b)and set σ = min(b−µ, µ−a)/3 (d) Generate⌊w i ·N⌋samples (e) Clip the samples to[a, b]

[13] [13]

An 10.0 7.5 5.0 2.5 0.0 2.5 5.0 x0 0 20 40 60 80 100y Figure 9: Samples from the diverse sampling strategy fory = x2

Generate the remaining sample points from the first mixture component In our implementation, we set the maximum number of mixture components toK = 3. An 10.0 7.5 5.0 2.5 0.0 2.5 5.0 x0 0 20 40 60 80 100y Figure 9: Samples from the diverse sampling strategy fory = x2

[14] [14]

example of samples generated for the functiony=x2 0 is shown in Figure 9

Points are sampled from a mixture model, where the mixture components are either uniform or gaussian. example of samples generated for the functiony=x2 0 is shown in Figure 9. 36