Transformer Semantic Genetic Programming for d-dimensional Symbolic Regression Problems
Pith reviewed 2026-05-17 23:09 UTC · model grok-4.3
The pith
A single pre-trained transformer acts as a semantic variation operator in genetic programming to produce offspring with similar meaning but varied structure for symbolic regression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transformer Semantic Genetic Programming treats a pre-trained transformer as a variation operator that generates new programs whose output values stay close to those of a parent on the training points. One model trained across many programs generalizes to symbolic regression tasks with different input dimensions. Across 24 benchmarks the method achieves an average rank of 1.58, produces smaller solutions than SLIM_GSGP while reaching higher accuracy, and uses the chosen semantic distance to trade consistent small improvements against faster convergence and compactness.
What carries the argument
Pre-trained transformer model used as a variation operator that proposes diverse structural changes while preserving high semantic similarity to the parent program.
If this is right
- TSGP reaches higher accuracy than standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP on the same 24 datasets.
- Solutions found by TSGP are more compact than those found by SLIM_GSGP at equal or better accuracy.
- Choosing a small target semantic distance produces steady fitness gains but tends to grow program size.
- Choosing a larger target semantic distance speeds convergence and keeps programs smaller.
- The single transformer removes the need for hand-crafted syntactic transformation rules in semantic GP.
Where Pith is reading between the lines
- The same transformer could be tested as a variation operator inside other evolutionary program-synthesis frameworks beyond regression.
- Retraining or fine-tuning the transformer on programs from a narrow domain might further improve performance on that domain at the cost of losing cross-dimension generality.
- Because the operator learns from program semantics rather than syntax, it might transfer to regression tasks whose function sets differ from the training distribution.
- The semantic-distance knob offers a direct handle on exploration-exploitation balance that could be combined with other population-management techniques.
Load-bearing premise
A single transformer trained on millions of programs can generalize to symbolic regression problems that differ in the number of input dimensions.
What would settle it
Running TSGP on a set of symbolic regression problems whose input dimensions lie well outside the range seen during transformer training and observing whether accuracy or compactness collapses relative to baselines.
Figures
read the original abstract
Transformer Semantic Genetic Programming (TSGP) is a semantic search approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with high semantic similarity to a given parent. Unlike other semantic GP approaches that rely on fixed syntactic transformations, TSGP aims to learn diverse structural variations that lead to solutions with similar semantics. We find that a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension. Evaluated on 24 real-world and synthetic datasets, TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP, achieving an average rank of 1.58 across all benchmarks. Moreover, TSGP produces more compact solutions than SLIM_GSGP, despite its higher accuracy. In addition, the target semantic distance is able to effectively adjust the step size in the semantic space: small values enable consistent improvement in fitness but often lead to larger programs, while larger values promote faster convergence and compactness. Thus, the target semantic distance provides an effective mechanism for balancing exploration and exploitation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper introduces Transformer Semantic Genetic Programming (TSGP), a semantic GP approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with high semantic similarity to a parent. The central claims are that a single transformer trained on millions of programs generalizes across symbolic regression problems of varying dimension d, that TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP on 24 real-world and synthetic datasets (average rank 1.58), produces more compact solutions than SLIM_GSGP, and that the target semantic distance parameter effectively balances exploration/exploitation by controlling step size in semantic space.
Significance. If the empirical claims hold under rigorous verification, this work would advance semantic genetic programming by demonstrating that learned transformer-based variation operators can outperform fixed syntactic or other learned baselines while yielding compact expressions. The reported ability of one model to handle varying d is noteworthy for symbolic regression applications if the training distribution is shown to cover the necessary range; the 24-dataset evaluation provides a broad test bed, and the explicit control via target semantic distance offers a practical tuning mechanism.
major comments (2)
- [Methods] Methods section on program generation and transformer pre-training: The generalization claim ('a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension') is load-bearing for the headline result. No details are provided on the distribution of d (number of variables) or arity in the training corpus; if generation fixes d or samples from a narrow band, performance on higher-d test problems cannot be attributed to learned cross-dimensional generalization and may instead reflect in-distribution behavior or per-problem adaptation.
- [Experiments / Results] Experimental results and Table reporting ranks: The average rank of 1.58 and statement of significant outperformance lack per-dataset ranks, standard deviations, error bars, or statistical tests (e.g., Friedman test with post-hoc Nemenyi or Wilcoxon signed-rank). Without these, the ranking claim across the 24 datasets cannot be fully assessed for robustness against benchmark selection or post-hoc choices.
minor comments (2)
- [Abstract] Abstract: The phrase 'significantly outperforms' should cross-reference the specific statistical procedure and threshold used in the main text for transparency.
- [Notation / Introduction] Notation: Clarify whether 'd' consistently denotes the number of input variables across training generation, problem statements, and results; add a brief definition on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section on program generation and transformer pre-training: The generalization claim ('a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension') is load-bearing for the headline result. No details are provided on the distribution of d (number of variables) or arity in the training corpus; if generation fixes d or samples from a narrow band, performance on higher-d test problems cannot be attributed to learned cross-dimensional generalization and may instead reflect in-distribution behavior or per-problem adaptation.
Authors: We agree that explicit details on the training distribution are necessary to support the generalization claim. The original manuscript omitted a precise description of how d and arity were sampled when generating the pre-training corpus. In the revised version we will expand the Methods section to document the program generation procedure, including the range and sampling strategy for the number of variables and operator arities. This addition will allow readers to evaluate whether the observed performance on test problems of varying dimension reflects genuine cross-dimensional generalization. revision: yes
-
Referee: [Experiments / Results] Experimental results and Table reporting ranks: The average rank of 1.58 and statement of significant outperformance lack per-dataset ranks, standard deviations, error bars, or statistical tests (e.g., Friedman test with post-hoc Nemenyi or Wilcoxon signed-rank). Without these, the ranking claim across the 24 datasets cannot be fully assessed for robustness against benchmark selection or post-hoc choices.
Authors: We concur that additional statistical detail would improve the transparency and robustness of the empirical results. The current manuscript reports only the average rank without per-dataset values or formal tests. In the revision we will add a table (or appendix) listing the rank of TSGP on each of the 24 datasets, include standard deviations or error bars for the performance metrics, and conduct a Friedman test with Nemenyi post-hoc analysis to substantiate the significance of the reported outperformance. revision: yes
Circularity Check
No circularity: empirical performance claims rest on external benchmarks
full rationale
The paper proposes TSGP as a semantic GP method that uses a pre-trained transformer for variation. Its central claims are empirical: a single model generalizes across dimensions and achieves average rank 1.58 on 24 external datasets, outperforming standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP. No derivation, equation, or first-principles result is presented that reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain. Performance is measured against independent baselines; the training-distribution coverage concern is an assumption about generalization, not a circular reduction of the reported result to the paper's own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- target semantic distance
axioms (1)
- domain assumption A transformer trained on millions of programs can produce structurally diverse offspring that preserve semantic similarity across varying problem dimensions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension... target semantic distance SD_t controls the step size in the semantic space
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We find that a single transformer model... generalize across... varying dimension
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Large language models suffer from their own output: An analysis of the self-consuming training loop.arXiv preprint arXiv:2311.16822(2023). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al
-
[2]
Mauro Castelli, Leonardo Vanneschi, and Sara Silva
Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Mauro Castelli, Leonardo Vanneschi, and Sara Silva
work page 2020
-
[3]
Prediction of high performance concrete strength using genetic programming with geometric semantic genetic operators.Expert Systems with Applications40, 17 (2013), 6856–6862. François Chollet et al
work page 2013
-
[4]
InProceedings of the 2020 Genetic and Evolutionary Computation Conference
Feature standardisation and coefficient optimisation for effective symbolic regression. InProceedings of the 2020 Genetic and Evolutionary Computation Conference. 306–314. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou
work page 2020
-
[5]
The faiss library.arXiv preprint arXiv:2401.08281(2024). Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner Gardner, Marc Parizeau, and Christian Gagné
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
DEAP: Evolutionary algorithms made easy.The Journal of Machine Learning Research13, 1 (2012), 2171–2175. Transformer Semantic Genetic Programming for𝑑-dimensional Symbolic Regression Problems 17 Noman Javed, Fernand Gobet, and Peter Lane
work page 2012
-
[7]
Pierre-Alexandre Kamienny, Stéphane d’Ascoli, Guillaume Lample, and François Charton
Simplification of genetic programs: a literature survey.Data Mining and Knowledge Discovery36, 4 (2022), 1279–1300. Pierre-Alexandre Kamienny, Stéphane d’Ascoli, Guillaume Lample, and François Charton
work page 2022
-
[8]
End-to-end symbolic regression with transformers.Advances in Neural Information Processing Systems35 (2022), 10269–10281. John R Koza. 1993.On the programming of computers by means of natural selection. MIT press. Krzysztof Krawiec and Pawel Lichocki
work page 2022
-
[9]
Contemporary symbolic regression methods and their relative performance.Advances in neural information processing systems2021, DB1 (2021),
work page 2021
-
[10]
Ilya Loshchilov and Frank Hutter
A unified framework for deep symbolic regression.Advances in Neural Information Processing Systems35 (2022), 33985–33998. Ilya Loshchilov and Frank Hutter
work page 2022
-
[11]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101(2017). Sean Luke and Liviu Panait
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Geometric semantic genetic programming for financial data. InApplications of Evolutionary Computation: 17th European Conference, EvoApplications 2014, Granada, Spain, April 23-25, 2014, Revised Selected Papers
work page 2014
-
[13]
Smith, Mateusz Paprocki, Ondrej Certik, Sergey B
SymPy: symbolic computing in Python.PeerJ Computer Science3 (Jan. 2017), e103. https://doi.org/10.7717/peerj-cs.103 Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson
-
[14]
Bayesian Segmentation of Atrium Wall Using Globally-Optimal Graph Cuts on 3D Meshes
Geometric Semantic Genetic Programming. InParallel Problem Solving from Nature - PPSN XII, Carlos A. Coello Coello, Vincenzo Cutello, Kalyanmoy Deb, Stephanie Forrest, Giuseppe Nicosia, and Mario Pavone (Eds.). Springer, Berlin, Heidelberg, 21–31. https://doi.org/10.1007/978-3-642- 32937-1_3 Caitlin A Owen, Grant Dick, and Peter A Whigham
-
[15]
Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients.arXiv preprint arXiv:1912.04871(2019). Joseph D Romano, Trang T Le, William La Cava, John T Gregg, Daniel J Goldberg, Praneel Chakraborty, Natasha L Ray, Daniel Himmelstein, Weixuan Fu, and Jason H Moore
-
[16]
PMLB v1.0: an open source dataset collection for benchmarking machine learning methods.arXiv preprint arXiv:2012.00058v2(2021). Franz Rothlauf. 2011.Design of Modern Heuristics: Principles and Application(1st ed.). Springer Publishing Company, Incorporated. Nguyen Quang Uy, Nguyen Xuan Hoai, and Michael O’Neill
-
[17]
Semantically-based crossover in genetic programming: application to real-valued symbolic regression.Genetic Programming and Evolvable Machines12 (2011), 91–119. Leonardo Vanneschi
work page 2011
-
[18]
An introduction to geometric semantic genetic programming. InNEO 2015: Results of the Numerical and Evolutionary Optimization Workshop NEO 2015 held at September 23-25 2015 in Tijuana, Mexico. Springer, 3–42. Leonardo Vanneschi
work page 2015
-
[19]
InEuropean Conference on Genetic Programming
A new implementation of geometric semantic GP and its application to problems in pharmacokinetics. InEuropean Conference on Genetic Programming. Springer, 205–216. Leonardo Vanneschi, Mauro Castelli, and Sara Silva. 2014a. A survey of semantic methods in genetic programming.Genetic Programming and Evolvable Machines15 (2014), 195–214. 18 Anthes et al. Leo...
work page 2014
-
[20]
Henrik Voigt, Paul Kahlmeyer, Kai Lawonn, Michael Habeck, and Joachim Giesen
Attention is all you need.Advances in neural information processing systems30 (2017). Henrik Voigt, Paul Kahlmeyer, Kai Lawonn, Michael Habeck, and Joachim Giesen
work page 2017
-
[21]
Analyzing Generalization in Pre-Trained Symbolic Regression.arXiv preprint arXiv:2509.19849(2025). David Wittenberg
-
[22]
Denoising autoencoder genetic programming: strategies to control exploration and exploitation in search.Genetic Programming and Evolvable Machines24, 2 (2023),
work page 2023
-
[23]
InProceedings of the 2020 Genetic and Evolutionary Computation Conference
DAE-GP: denoising autoencoder LSTM networks as probabilistic models in estimation of distribution genetic programming. InProceedings of the 2020 Genetic and Evolutionary Computation Conference. 1037–1045. Transformer Semantic Genetic Programming for𝑑-dimensional Symbolic Regression Problems 19 A TSGP Vocabulary The vocabulary of the transformer model is d...
work page 2020
-
[24]
Bold values indicate the best prediction quality (lowest RMSE)
Median training RMSE of the best programs (solutions) found within 100 generations for TSGP with SD𝑡 = 1, stdGP, SLIM_GSGP (SLIM), DSR, and DAE-GP (DAE) for the 24 analyzed datasets. Bold values indicate the best prediction quality (lowest RMSE). Significant differences of the best results are indicated by the label symbols. Data set 𝑎TSGP1 𝑏stdGP 𝑐 SLIM ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.