pith. sign in

arxiv: 2511.09416 · v2 · submitted 2025-11-12 · 💻 cs.LG · cs.NE

Transformer Semantic Genetic Programming for d-dimensional Symbolic Regression Problems

Pith reviewed 2026-05-17 23:09 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords symbolic regressiongenetic programmingtransformer modelsemantic searchevolutionary algorithmsprogram synthesismachine learning
0
0 comments X

The pith

A single pre-trained transformer acts as a semantic variation operator in genetic programming to produce offspring with similar meaning but varied structure for symbolic regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training one transformer on millions of programs lets it serve as a general-purpose operator that suggests structural changes while keeping semantics close to a parent program. This approach replaces fixed syntactic rules used in earlier semantic genetic programming methods. On 24 real-world and synthetic datasets the method ranks first on average, beats standard GP and several recent competitors, and yields more compact expressions. The target semantic distance parameter lets users control whether the search makes small steady steps or larger jumps toward better fitness. A sympathetic reader would care because symbolic regression seeks explicit, interpretable equations rather than black-box predictors, and a generalizable operator could reduce the need for problem-specific tuning.

Core claim

Transformer Semantic Genetic Programming treats a pre-trained transformer as a variation operator that generates new programs whose output values stay close to those of a parent on the training points. One model trained across many programs generalizes to symbolic regression tasks with different input dimensions. Across 24 benchmarks the method achieves an average rank of 1.58, produces smaller solutions than SLIM_GSGP while reaching higher accuracy, and uses the chosen semantic distance to trade consistent small improvements against faster convergence and compactness.

What carries the argument

Pre-trained transformer model used as a variation operator that proposes diverse structural changes while preserving high semantic similarity to the parent program.

If this is right

  • TSGP reaches higher accuracy than standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP on the same 24 datasets.
  • Solutions found by TSGP are more compact than those found by SLIM_GSGP at equal or better accuracy.
  • Choosing a small target semantic distance produces steady fitness gains but tends to grow program size.
  • Choosing a larger target semantic distance speeds convergence and keeps programs smaller.
  • The single transformer removes the need for hand-crafted syntactic transformation rules in semantic GP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transformer could be tested as a variation operator inside other evolutionary program-synthesis frameworks beyond regression.
  • Retraining or fine-tuning the transformer on programs from a narrow domain might further improve performance on that domain at the cost of losing cross-dimension generality.
  • Because the operator learns from program semantics rather than syntax, it might transfer to regression tasks whose function sets differ from the training distribution.
  • The semantic-distance knob offers a direct handle on exploration-exploitation balance that could be combined with other population-management techniques.

Load-bearing premise

A single transformer trained on millions of programs can generalize to symbolic regression problems that differ in the number of input dimensions.

What would settle it

Running TSGP on a set of symbolic regression problems whose input dimensions lie well outside the range seen during transformer training and observing whether accuracy or compactness collapses relative to baselines.

Figures

Figures reproduced from arXiv: 2511.09416 by Dominik Sobania, Franz Rothlauf, Philipp Anthes.

Figure 1
Figure 1. Figure 1: Model Building of TSGP: (1) Diverse functions are generated and their semantics are approximated; (2) Semantically similar pairs are identified through a 𝑘-NN search in the semantic space; (3) These pairs are used as input-output examples to train a transformer model, conditioned on their semantic distance SD and the problem dimensionality 𝑑. similar function pairs are identified and grouped to input-outpu… view at source ↗
Figure 2
Figure 2. Figure 2: Median training RMSE of the best programs (solutions) of TSGP, stdGP, SLIM_GSGP over generations [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Median size of the solutions of TSGP, stdGP, SLIM_GSGP over generations on a subset of the analyzed [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Median Euclidean distance between the semantics [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Median number of generations without improving the training RMSE over the number of generations. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Median training RMSE of the solutions of TSGP, stdGP, SLIM_GSGP over the number of generations [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Median solution size of TSGP, stdGP, SLIM_GSGP over the number of generations on the remaining [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Median Euclidean distance between the semantics [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Median number of generations without improving the training RMSE over the number of generations. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Transformer Semantic Genetic Programming (TSGP) is a semantic search approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with high semantic similarity to a given parent. Unlike other semantic GP approaches that rely on fixed syntactic transformations, TSGP aims to learn diverse structural variations that lead to solutions with similar semantics. We find that a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension. Evaluated on 24 real-world and synthetic datasets, TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP, achieving an average rank of 1.58 across all benchmarks. Moreover, TSGP produces more compact solutions than SLIM_GSGP, despite its higher accuracy. In addition, the target semantic distance is able to effectively adjust the step size in the semantic space: small values enable consistent improvement in fitness but often lead to larger programs, while larger values promote faster convergence and compactness. Thus, the target semantic distance provides an effective mechanism for balancing exploration and exploitation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper introduces Transformer Semantic Genetic Programming (TSGP), a semantic GP approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with high semantic similarity to a parent. The central claims are that a single transformer trained on millions of programs generalizes across symbolic regression problems of varying dimension d, that TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP on 24 real-world and synthetic datasets (average rank 1.58), produces more compact solutions than SLIM_GSGP, and that the target semantic distance parameter effectively balances exploration/exploitation by controlling step size in semantic space.

Significance. If the empirical claims hold under rigorous verification, this work would advance semantic genetic programming by demonstrating that learned transformer-based variation operators can outperform fixed syntactic or other learned baselines while yielding compact expressions. The reported ability of one model to handle varying d is noteworthy for symbolic regression applications if the training distribution is shown to cover the necessary range; the 24-dataset evaluation provides a broad test bed, and the explicit control via target semantic distance offers a practical tuning mechanism.

major comments (2)
  1. [Methods] Methods section on program generation and transformer pre-training: The generalization claim ('a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension') is load-bearing for the headline result. No details are provided on the distribution of d (number of variables) or arity in the training corpus; if generation fixes d or samples from a narrow band, performance on higher-d test problems cannot be attributed to learned cross-dimensional generalization and may instead reflect in-distribution behavior or per-problem adaptation.
  2. [Experiments / Results] Experimental results and Table reporting ranks: The average rank of 1.58 and statement of significant outperformance lack per-dataset ranks, standard deviations, error bars, or statistical tests (e.g., Friedman test with post-hoc Nemenyi or Wilcoxon signed-rank). Without these, the ranking claim across the 24 datasets cannot be fully assessed for robustness against benchmark selection or post-hoc choices.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'significantly outperforms' should cross-reference the specific statistical procedure and threshold used in the main text for transparency.
  2. [Notation / Introduction] Notation: Clarify whether 'd' consistently denotes the number of input variables across training generation, problem statements, and results; add a brief definition on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section on program generation and transformer pre-training: The generalization claim ('a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension') is load-bearing for the headline result. No details are provided on the distribution of d (number of variables) or arity in the training corpus; if generation fixes d or samples from a narrow band, performance on higher-d test problems cannot be attributed to learned cross-dimensional generalization and may instead reflect in-distribution behavior or per-problem adaptation.

    Authors: We agree that explicit details on the training distribution are necessary to support the generalization claim. The original manuscript omitted a precise description of how d and arity were sampled when generating the pre-training corpus. In the revised version we will expand the Methods section to document the program generation procedure, including the range and sampling strategy for the number of variables and operator arities. This addition will allow readers to evaluate whether the observed performance on test problems of varying dimension reflects genuine cross-dimensional generalization. revision: yes

  2. Referee: [Experiments / Results] Experimental results and Table reporting ranks: The average rank of 1.58 and statement of significant outperformance lack per-dataset ranks, standard deviations, error bars, or statistical tests (e.g., Friedman test with post-hoc Nemenyi or Wilcoxon signed-rank). Without these, the ranking claim across the 24 datasets cannot be fully assessed for robustness against benchmark selection or post-hoc choices.

    Authors: We concur that additional statistical detail would improve the transparency and robustness of the empirical results. The current manuscript reports only the average rank without per-dataset values or formal tests. In the revision we will add a table (or appendix) listing the rank of TSGP on each of the 24 datasets, include standard deviations or error bars for the performance metrics, and conduct a Friedman test with Nemenyi post-hoc analysis to substantiate the significance of the reported outperformance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks

full rationale

The paper proposes TSGP as a semantic GP method that uses a pre-trained transformer for variation. Its central claims are empirical: a single model generalizes across dimensions and achieves average rank 1.58 on 24 external datasets, outperforming standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP. No derivation, equation, or first-principles result is presented that reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain. Performance is measured against independent baselines; the training-distribution coverage concern is an assumption about generalization, not a circular reduction of the reported result to the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of a pre-trained transformer for semantic variation and on the assumption that semantic distance can be used as a controllable search parameter; no explicit free parameters beyond the tunable semantic distance are stated.

free parameters (1)
  • target semantic distance
    Controls step size in semantic space and is presented as a mechanism for balancing exploration and exploitation; its specific values are chosen per experiment.
axioms (1)
  • domain assumption A transformer trained on millions of programs can produce structurally diverse offspring that preserve semantic similarity across varying problem dimensions
    Invoked to justify using one model for all d-dimensional symbolic regression tasks.

pith-pipeline@v0.9.0 · 5490 in / 1322 out tokens · 63198 ms · 2026-05-17T23:09:39.735872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    arXiv preprint , volume =

    Large language models suffer from their own output: An analysis of the self-consuming training loop.arXiv preprint arXiv:2311.16822(2023). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

  2. [2]

    Mauro Castelli, Leonardo Vanneschi, and Sara Silva

    Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Mauro Castelli, Leonardo Vanneschi, and Sara Silva

  3. [3]

    François Chollet et al

    Prediction of high performance concrete strength using genetic programming with geometric semantic genetic operators.Expert Systems with Applications40, 17 (2013), 6856–6862. François Chollet et al

  4. [4]

    InProceedings of the 2020 Genetic and Evolutionary Computation Conference

    Feature standardisation and coefficient optimisation for effective symbolic regression. InProceedings of the 2020 Genetic and Evolutionary Computation Conference. 306–314. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou

  5. [5]

    The Faiss library

    The faiss library.arXiv preprint arXiv:2401.08281(2024). Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner Gardner, Marc Parizeau, and Christian Gagné

  6. [6]

    Transformer Semantic Genetic Programming for𝑑-dimensional Symbolic Regression Problems 17 Noman Javed, Fernand Gobet, and Peter Lane

    DEAP: Evolutionary algorithms made easy.The Journal of Machine Learning Research13, 1 (2012), 2171–2175. Transformer Semantic Genetic Programming for𝑑-dimensional Symbolic Regression Problems 17 Noman Javed, Fernand Gobet, and Peter Lane

  7. [7]

    Pierre-Alexandre Kamienny, Stéphane d’Ascoli, Guillaume Lample, and François Charton

    Simplification of genetic programs: a literature survey.Data Mining and Knowledge Discovery36, 4 (2022), 1279–1300. Pierre-Alexandre Kamienny, Stéphane d’Ascoli, Guillaume Lample, and François Charton

  8. [8]

    John R Koza

    End-to-end symbolic regression with transformers.Advances in Neural Information Processing Systems35 (2022), 10269–10281. John R Koza. 1993.On the programming of computers by means of natural selection. MIT press. Krzysztof Krawiec and Pawel Lichocki

  9. [9]

    Contemporary symbolic regression methods and their relative performance.Advances in neural information processing systems2021, DB1 (2021),

  10. [10]

    Ilya Loshchilov and Frank Hutter

    A unified framework for deep symbolic regression.Advances in Neural Information Processing Systems35 (2022), 33985–33998. Ilya Loshchilov and Frank Hutter

  11. [11]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101(2017). Sean Luke and Liviu Panait

  12. [12]

    InApplications of Evolutionary Computation: 17th European Conference, EvoApplications 2014, Granada, Spain, April 23-25, 2014, Revised Selected Papers

    Geometric semantic genetic programming for financial data. InApplications of Evolutionary Computation: 17th European Conference, EvoApplications 2014, Granada, Spain, April 23-25, 2014, Revised Selected Papers

  13. [13]

    Smith, Mateusz Paprocki, Ondrej Certik, Sergey B

    SymPy: symbolic computing in Python.PeerJ Computer Science3 (Jan. 2017), e103. https://doi.org/10.7717/peerj-cs.103 Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson

  14. [14]

    Bayesian Segmentation of Atrium Wall Using Globally-Optimal Graph Cuts on 3D Meshes

    Geometric Semantic Genetic Programming. InParallel Problem Solving from Nature - PPSN XII, Carlos A. Coello Coello, Vincenzo Cutello, Kalyanmoy Deb, Stephanie Forrest, Giuseppe Nicosia, and Mario Pavone (Eds.). Springer, Berlin, Heidelberg, 21–31. https://doi.org/10.1007/978-3-642- 32937-1_3 Caitlin A Owen, Grant Dick, and Peter A Whigham

  15. [15]

    Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients

    Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients.arXiv preprint arXiv:1912.04871(2019). Joseph D Romano, Trang T Le, William La Cava, John T Gregg, Daniel J Goldberg, Praneel Chakraborty, Natasha L Ray, Daniel Himmelstein, Weixuan Fu, and Jason H Moore

  16. [16]

    Franz Rothlauf

    PMLB v1.0: an open source dataset collection for benchmarking machine learning methods.arXiv preprint arXiv:2012.00058v2(2021). Franz Rothlauf. 2011.Design of Modern Heuristics: Principles and Application(1st ed.). Springer Publishing Company, Incorporated. Nguyen Quang Uy, Nguyen Xuan Hoai, and Michael O’Neill

  17. [17]

    Leonardo Vanneschi

    Semantically-based crossover in genetic programming: application to real-valued symbolic regression.Genetic Programming and Evolvable Machines12 (2011), 91–119. Leonardo Vanneschi

  18. [18]

    InNEO 2015: Results of the Numerical and Evolutionary Optimization Workshop NEO 2015 held at September 23-25 2015 in Tijuana, Mexico

    An introduction to geometric semantic genetic programming. InNEO 2015: Results of the Numerical and Evolutionary Optimization Workshop NEO 2015 held at September 23-25 2015 in Tijuana, Mexico. Springer, 3–42. Leonardo Vanneschi

  19. [19]

    InEuropean Conference on Genetic Programming

    A new implementation of geometric semantic GP and its application to problems in pharmacokinetics. InEuropean Conference on Genetic Programming. Springer, 205–216. Leonardo Vanneschi, Mauro Castelli, and Sara Silva. 2014a. A survey of semantic methods in genetic programming.Genetic Programming and Evolvable Machines15 (2014), 195–214. 18 Anthes et al. Leo...

  20. [20]

    Henrik Voigt, Paul Kahlmeyer, Kai Lawonn, Michael Habeck, and Joachim Giesen

    Attention is all you need.Advances in neural information processing systems30 (2017). Henrik Voigt, Paul Kahlmeyer, Kai Lawonn, Michael Habeck, and Joachim Giesen

  21. [21]

    David Wittenberg

    Analyzing Generalization in Pre-Trained Symbolic Regression.arXiv preprint arXiv:2509.19849(2025). David Wittenberg

  22. [22]

    Denoising autoencoder genetic programming: strategies to control exploration and exploitation in search.Genetic Programming and Evolvable Machines24, 2 (2023),

  23. [23]

    InProceedings of the 2020 Genetic and Evolutionary Computation Conference

    DAE-GP: denoising autoencoder LSTM networks as probabilistic models in estimation of distribution genetic programming. InProceedings of the 2020 Genetic and Evolutionary Computation Conference. 1037–1045. Transformer Semantic Genetic Programming for𝑑-dimensional Symbolic Regression Problems 19 A TSGP Vocabulary The vocabulary of the transformer model is d...

  24. [24]

    Bold values indicate the best prediction quality (lowest RMSE)

    Median training RMSE of the best programs (solutions) found within 100 generations for TSGP with SD𝑡 = 1, stdGP, SLIM_GSGP (SLIM), DSR, and DAE-GP (DAE) for the 24 analyzed datasets. Bold values indicate the best prediction quality (lowest RMSE). Significant differences of the best results are indicated by the label symbols. Data set 𝑎TSGP1 𝑏stdGP 𝑐 SLIM ...