LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Chandan K. Reddy; Nikhil Abhyankar; Parshin Shojaee

arxiv: 2503.14434 · v3 · submitted 2025-03-18 · 💻 cs.LG · cs.AI· cs.CL· cs.NE

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Nikhil Abhyankar , Parshin Shojaee , Chandan K. Reddy This is my paper

Pith reviewed 2026-05-22 23:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.NE

keywords automated feature engineeringlarge language modelsevolutionary searchtabular dataprogram synthesismachine learning

0 comments

The pith

LLM-FE treats feature engineering for tabular data as an evolutionary program search guided by LLM proposals and validation feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that combining large language models with evolutionary search overcomes the limits of fixed transformation spaces and one-shot prompting in automated feature engineering. LLMs propose new feature transformation programs in each generation, while data-driven performance scores select and refine the population over iterations. A sympathetic reader would expect this to produce features that improve downstream model accuracy on classification and regression tasks more reliably than baselines that lack iterative reasoning or domain knowledge. The central mechanism is the closed loop between LLM generation and empirical fitness signals.

Core claim

LLM-FE formulates feature engineering as a program search problem in which large language models iteratively propose feature transformation programs, and validation scores from the downstream tabular model supply the selection pressure that evolves higher-performing programs across generations, yielding consistent gains over state-of-the-art baselines on diverse classification and regression benchmarks.

What carries the argument

Evolutionary search loop in which LLMs serve as the variation operators that generate candidate feature transformation programs and validation performance ranks them for retention and further refinement.

If this is right

Tabular models obtain higher predictive performance on standard benchmarks without requiring hand-crafted features.
The search process incorporates dataset-specific domain knowledge that static operator libraries cannot supply.
Feature discovery adapts over multiple rounds using concrete performance data rather than relying on a single LLM call.
Both classification and regression tasks benefit from the same iterative program-evolution procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evolutionary prompting pattern could be applied to other program-synthesis settings such as data preprocessing pipelines or model architecture search.
Generated features might serve as interpretable artifacts that reveal which data relationships the LLM has surfaced.
Efficiency gains could come from caching high-performing program fragments across related datasets rather than restarting evolution each time.

Load-bearing premise

Validation scores supply a sufficiently clean and informative ranking signal so that LLM-proposed programs improve over generations rather than being dominated by invalid or unhelpful outputs.

What would settle it

On a held-out collection of tabular classification and regression datasets, LLM-FE produces no statistically significant accuracy lift relative to the strongest fixed-space or non-evolutionary LLM baseline after multiple independent runs.

Figures

Figures reproduced from arXiv: 2503.14434 by Chandan K. Reddy, Nikhil Abhyankar, Parshin Shojaee.

**Figure 1.** Figure 1: Overview of the LLM-FE Framework. For a given dataset, LLM-FE follows these steps: (a) New Feature Generation, where an LLM generates feature transformation hypotheses as programs for a given tabular dataset; (b) Feature Engineering, where the feature transformation program is applied to the underlying dataset, resulting in a modified dataset; (c) Feature Evaluation, where the modified dataset with the new… view at source ↗

**Figure 2.** Figure 2: Aggregated ablation study results across [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Analysis on Impact of Domain Knowledge. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative impact of domain knowledge on model accuracy. Using domain knowledge boosts performance compared to both the base model and LLM-FE without domain knowledge. 0 5 10 15 20 Iterations 0.750 0.755 0.760 0.765 Accuracy Validation Accuracy LLM-FE w/o Evolutionary Refinement LLM-FE [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Example of an input prompt for balance-scale dataset [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: An example of the alternate set of instructions [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of Noise Levels on XGBoost model performance across different feature engineering approaches, under increasing noise conditions (σ = 0.0 to 0.1). We report the mean accuracy across six classification datasets containing only numerical features. D Qualitative Analysis D.1 Computational Efficiency 0 100 200 300 400 500 600 700 Time (seconds) 0.83 0.84 0.85 0.86 Performance Pareto Plot: Time vs Perform… view at source ↗

**Figure 9.** Figure 9: Pareto Plot: comparing trade-off between performance (accuracy) vs time (in seconds) for LLM-FE and other feature engineering baselines. Automated feature engineering methods, both classical and LLM-based, universally employ model training and validation to evaluate feature relevance. This evaluation strategy represents standard methodology across all automated feature engineering approaches rather than an… view at source ↗

**Figure 10.** Figure 10: Frequency of Feature Engineering Operators. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Quantitative and Qualitative Analysis on Impact of Domain Knowledge for LLM-FE on Heart [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: shows the detailed performance trajectory of LLM-FE compared with its ‘w/o Evolutionary Refinement’ variant on PC1 and Balance-Scale datasets. The graph demonstrates that LLM-FE, using evolutionary search, consistently improves validation accuracy, while the non-refinement variant stagnates due to local optima. On the PC1 dataset, the non-refinement variant plateaus after seven iterations, and on the Bala… view at source ↗

read the original abstract

Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks. The code is available at: https://github.com/nikhilsab/LLMFE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM-FE adds an evolutionary loop to LLM-driven feature program search for tabular data, but the absence of any numbers on proposal validity, error rates, or actual performance gains leaves the central claim unsupported.

read the letter

The main takeaway is that this paper frames feature engineering as an iterative program search where LLMs propose transformations and validation scores guide selection across generations. It tries to fix earlier LLM methods that either prompt once or select only by score without using history or reasoning about past attempts. The approach is straightforward and targets a real pain point in tabular pipelines where manual features still dominate. Releasing the code is useful for anyone who wants to inspect or extend the implementation. The formulation as program search lets the LLM draw on domain knowledge for things like custom aggregations or interactions that fixed libraries miss. That part is a reasonable step forward from direct prompting. The soft spot is exactly the one the stress test flags. If a large share of LLM outputs are syntactically invalid, non-executable, or produce constant or duplicate features, the validation scores supply little useful signal and the evolutionary loop adds little beyond random search or the base model. The abstract and description give no figures on proposal success rate, how errors are filtered or repaired, or ablation results that isolate the evolutionary component. Without those, the strong claim of consistent outperformance over state-of-the-art baselines cannot be evaluated. The paper is aimed at applied ML practitioners working with tabular classification and regression who already use AutoML tools and want to test whether LLM reasoning can reduce manual work. It could also interest people studying LLM-guided optimization more broadly. I would send it for peer review. The idea is clear enough and the practical motivation is sound; referees can check whether the experiments actually demonstrate that the evolutionary signal works and whether the reported gains hold up under proper controls and multiple runs.

Referee Report

2 major / 1 minor

Summary. The paper proposes LLM-FE, a framework that casts automated feature engineering for tabular data as an evolutionary program search problem. LLMs iteratively propose feature transformation programs, with data-driven validation scores providing the selection signal to guide the search and incorporate domain knowledge. The central claim is that this approach consistently outperforms state-of-the-art baselines and improves downstream tabular prediction performance on diverse classification and regression benchmarks.

Significance. If the empirical results and the reliability of the LLM-driven evolutionary loop hold, the work could meaningfully advance automated feature engineering by moving beyond fixed, manually designed transformation spaces to leverage LLM reasoning in an iterative, feedback-guided manner. The public code release is a positive contribution for reproducibility.

major comments (2)

[Method (program search formulation)] The central claim that LLM-proposed programs yield reliable evolutionary improvement depends on validation scores supplying directional signal. The method description states that LLMs propose programs iteratively and data-driven feedback guides the search, yet no quantitative results are supplied on proposal validity rate, syntax/runtime error frequency, or the fraction of programs filtered before scoring. If a substantial fraction of proposals are non-executable or produce degenerate features, the ranking step supplies little signal and observed gains could be attributable to the base learner rather than the LLM-evolution loop.
[Abstract] Abstract asserts that LLM-FE 'consistently outperforms state-of-the-art baselines' and 'significantly enhancing the performance' across benchmarks, but supplies no quantitative results, baseline names, dataset list, statistical tests, or ablation details on the evolutionary component versus random search or direct prompting. Without these, the data-to-claim link cannot be evaluated.

minor comments (1)

[Abstract / Method] The abstract and method overview would benefit from a concise diagram or pseudocode of the evolutionary loop (proposal, execution, scoring, selection) to clarify the exact feedback mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript to strengthen the empirical grounding of the claims.

read point-by-point responses

Referee: [Method (program search formulation)] The central claim that LLM-proposed programs yield reliable evolutionary improvement depends on validation scores supplying directional signal. The method description states that LLMs propose programs iteratively and data-driven feedback guides the search, yet no quantitative results are supplied on proposal validity rate, syntax/runtime error frequency, or the fraction of programs filtered before scoring. If a substantial fraction of proposals are non-executable or produce degenerate features, the ranking step supplies little signal and observed gains could be attributable to the base learner rather than the LLM-evolution loop.

Authors: We agree that the current manuscript does not report these intermediate statistics on proposal validity and error rates. In the revised version we will add a dedicated analysis (new subsection or appendix) that quantifies, across the experimental runs, the fraction of LLM-proposed programs that are syntactically valid and executable, the frequency of runtime errors, and the proportion filtered before scoring. This will directly demonstrate that validation scores supply meaningful selection signal. All reported gains are measured against baselines that employ identical base learners on the same data splits, which already isolates the contribution of the LLM-driven evolutionary loop from the base model itself. revision: yes
Referee: [Abstract] Abstract asserts that LLM-FE 'consistently outperforms state-of-the-art baselines' and 'significantly enhancing the performance' across benchmarks, but supplies no quantitative results, baseline names, dataset list, statistical tests, or ablation details on the evolutionary component versus random search or direct prompting. Without these, the data-to-claim link cannot be evaluated.

Authors: The abstract is intentionally concise and high-level. The full manuscript supplies the requested details: baseline names and implementations, the complete dataset list, performance tables with statistical significance tests, and ablations that compare the evolutionary loop against random search and direct-prompting variants. To improve the abstract-to-claim linkage we will revise the abstract to include a short quantitative statement (e.g., average relative improvement and number of benchmarks) while preserving length constraints. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical method with external validation

full rationale

The paper describes an LLM-driven evolutionary search for feature programs, guided by validation scores on held-out data. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the abstract or method outline. Performance claims rest on benchmark comparisons rather than any internal reduction to the method's own inputs. This is the common case of a self-contained empirical framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the central claim rests on the unexamined premise that LLM-generated programs plus validation feedback produce net improvement. No free parameters, invented entities, or additional axioms are visible.

axioms (1)

domain assumption Large language models encode useful domain knowledge that can be elicited to propose feature transformations
The method description relies on LLMs supplying domain-informed proposals rather than random or template-based ones.

pith-pipeline@v0.9.0 · 5748 in / 1179 out tokens · 53255 ms · 2026-05-22T23:34:08.956506+00:00 · methodology

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
cs.AI 2026-04 unverdicted novelty 7.0

MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
cs.LG 2026-05 unverdicted novelty 6.0

BoostLLM trains sequential PEFT adapters as weak learners in a residual process, using decision-tree paths as a second input view, to improve few-shot tabular classification over standard LLM fine-tuning and match or ...
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
cs.LG 2026-05 unverdicted novelty 6.0

BoostLLM trains sequential PEFT adapters in a boosting framework with tree path inputs to improve LLM performance on few-shot tabular classification, matching or exceeding XGBoost.
TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

TriAlignGR proposes a triangular multitask alignment framework with cross-modal semantic alignment, deep interest mining via chain-of-thought, and joint training on eight tasks to address content degradation and seman...
FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data
cs.AI 2025-10 unverdicted novelty 6.0

FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
RelAgent: LLM Agents as Data Scientists for Relational Learning
cs.LG 2026-05 unverdicted novelty 5.0

RelAgent uses an LLM agent to autonomously generate SQL feature programs paired with classical models for interpretable relational learning predictions that execute efficiently on standard databases.
TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 5.0

TriAlignGR integrates visual content and latent user interests into Semantic IDs via cross-modal alignment, CoT-based interest mining, and triangular multitask training to address content degradation and semantic opac...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 5 Pith papers · 5 internal anchors

[1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

work page 2019
[2]

The state of data science 2020

Anaconda. The state of data science 2020. Website, 2020

work page 2020
[3]

Uci machine learning repository, 2007

Arthur Asuncion, David Newman, et al. Uci machine learning repository, 2007

work page 2007
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[5]

Evoprompting: language models for code-level neural architecture search

Angelica Chen, David Dohan, and David So. Evoprompting: language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[6]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

work page 2016
[7]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

Miles Cranmer. Interpretable machine learning for science with pysr and symbolicregression. jl. arXiv preprint arXiv:2305.01582, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Increased flexibility in genetic algorithms: The use of variable boltzmann selective pressure to control propagation

Michael De La Maza and Bruce Tidor. Increased flexibility in genetic algorithms: The use of variable boltzmann selective pressure to control propagation. In Computer Science and Operations Research, pages 425–440. Elsevier, 1992

work page 1992
[9]

Lift: Language-interfaced fine-tuning for non-language machine learning tasks

Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022

work page 2022
[10]

A few useful things to know about machine learning

Pedro Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10):78–87, 2012

work page 2012
[11]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Openml-python: an extensible python api for openml

Matthias Feurer, Jan N Van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, and Frank Hutter. Openml-python: an extensible python api for openml. Journal of Machine Learning Research, 22(100):1–5, 2021

work page 2021
[13]

Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, 34:18932– 18943, 2021

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, 34:18932– 18943, 2021

work page 2021
[14]

Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022. 10

work page 2022
[15]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Language models can teach themselves to program better

Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022

work page arXiv 2022
[17]

Large language models can automatically engineer features for few-shot tabular learning

Sungwon Han, Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Large language models can automatically engineer features for few-shot tabular learning. arXiv preprint arXiv:2404.09491, 2024

work page arXiv 2024
[18]

Tabllm: Few-shot classification of tabular data with large language models

Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023

work page 2023
[19]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Large language models for automated data science: Introducing caafe for context-aware automated feature engineering

Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[21]

The autofeat python library for automated feature engineering and selection

Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. In Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I, pages 111–120. Springer, 2020

work page 2019
[22]

Deep feature synthesis: Towards automating data science endeavors

James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA), pages 1–10. IEEE, 2015

work page 2015
[23]

Feature engineering for predictive modeling using reinforcement learning

Udayan Khurana, Horst Samulowitz, and Deepak Turaga. Feature engineering for predictive modeling using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[24]

Cognito: Automated feature engineering for supervised learning

Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. Cognito: Automated feature engineering for supervised learning. In 2016 IEEE 16th international conference on data mining workshops (ICDMW), pages 1304–1307. IEEE, 2016

work page 2016
[25]

Large language models engineer too many simple features for tabular data

Jaris Küken, Lennart Purucker, and Frank Hutter. Large language models engineer too many simple features for tabular data. arXiv preprint arXiv:2410.17787, 2024

work page arXiv 2024
[26]

Large language models as evolution strategies

Robert Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 579–582, 2024

work page 2024
[27]

Evolution through large models

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331–366. Springer, 2023

work page 2023
[28]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024

work page arXiv 2024
[29]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[30]

Language model crossover: Variation through few-shot prompting

Elliot Meyerson, Mark J Nelson, Herbie Bradley, Adam Gaier, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. ACM Transactions on Evolutionary Learning, 4(4):1–40, 2024

work page 2024
[31]

Optimized feature generation for tabular data via llms with decision tree reasoning

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized feature generation for tabular data via llms with decision tree reasoning. arXiv preprint arXiv:2406.08527, 2024

work page arXiv 2024
[32]

Stunt: Few-shot tabular learning with self-generated tasks from unlabeled tables

Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, and Jinwoo Shin. Stunt: Few-shot tabular learning with self-generated tasks from unlabeled tables. arXiv preprint arXiv:2303.00918, 2023. 11

work page arXiv 2023
[33]

Learning feature engineering for classification

Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil, and Deepak S Turaga. Learning feature engineering for classification. In Ijcai, volume 17, pages 2529–2535, 2017

work page 2017
[34]

GPT-4 Technical Report

R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

work page 2024
[36]

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400, 2024

work page arXiv 2024
[37]

Openml: networked science in machine learning

Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

work page 2014
[38]

Attention is all you need

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017

work page 2017
[39]

Anypredict: Foundation model for tabular prediction

Zifeng Wang, Chufan Gao, Cao Xiao, and Jimeng Sun. Anypredict: Foundation model for tabular prediction. CoRR, 2023

work page 2023
[40]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[41]

Evolutionary computation in the era of large language model: Survey and roadmap

Xingyu Wu, Sheng-hao Wu, Jibin Wu, Liang Feng, and Kay Chen Tan. Evolutionary computation in the era of large language model: Survey and roadmap. arXiv preprint arXiv:2401.10034, 2024

work page arXiv 2024
[42]

Making pre-trained language models great on tabular prediction

Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Z Chen, Jimeng Sun, Jian Wu, and Jintai Chen. Making pre-trained language models great on tabular prediction. arXiv preprint arXiv:2403.01841, 2024

work page arXiv 2024
[43]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers, 2024

work page 2024
[44]

Automatic feature engineering by deep reinforcement learning

Jianyu Zhang, Jianye Hao, Françoise Fogelman-Soulié, and Zan Wang. Automatic feature engineering by deep reinforcement learning. InProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 2312–2314, 2019

work page 2019
[45]

Openfe: automated feature generation with expert-level performance

Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023

work page 2023
[46]

Can GPT -4 Perform Neural Architecture Search ?, August 2023

Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can gpt-4 perform neural architecture search? arXiv preprint arXiv:2304.10970, 2023

work page arXiv 2023
[47]

""Improved version of modify_features_v0

Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023. Impact Statement The introduction of LLM-FE as a framework for leveraging LLMs in automated feature engineering has the potential to significantly impact the field of machine learning b...

work page arXiv 2023

[1] [1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

work page 2019

[2] [2]

The state of data science 2020

Anaconda. The state of data science 2020. Website, 2020

work page 2020

[3] [3]

Uci machine learning repository, 2007

Arthur Asuncion, David Newman, et al. Uci machine learning repository, 2007

work page 2007

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[5] [5]

Evoprompting: language models for code-level neural architecture search

Angelica Chen, David Dohan, and David So. Evoprompting: language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[6] [6]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

work page 2016

[7] [7]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

Miles Cranmer. Interpretable machine learning for science with pysr and symbolicregression. jl. arXiv preprint arXiv:2305.01582, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Increased flexibility in genetic algorithms: The use of variable boltzmann selective pressure to control propagation

Michael De La Maza and Bruce Tidor. Increased flexibility in genetic algorithms: The use of variable boltzmann selective pressure to control propagation. In Computer Science and Operations Research, pages 425–440. Elsevier, 1992

work page 1992

[9] [9]

Lift: Language-interfaced fine-tuning for non-language machine learning tasks

Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022

work page 2022

[10] [10]

A few useful things to know about machine learning

Pedro Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10):78–87, 2012

work page 2012

[11] [11]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Openml-python: an extensible python api for openml

Matthias Feurer, Jan N Van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, and Frank Hutter. Openml-python: an extensible python api for openml. Journal of Machine Learning Research, 22(100):1–5, 2021

work page 2021

[13] [13]

Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, 34:18932– 18943, 2021

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, 34:18932– 18943, 2021

work page 2021

[14] [14]

Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520, 2022. 10

work page 2022

[15] [15]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Language models can teach themselves to program better

Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022

work page arXiv 2022

[17] [17]

Large language models can automatically engineer features for few-shot tabular learning

Sungwon Han, Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Large language models can automatically engineer features for few-shot tabular learning. arXiv preprint arXiv:2404.09491, 2024

work page arXiv 2024

[18] [18]

Tabllm: Few-shot classification of tabular data with large language models

Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023

work page 2023

[19] [19]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Large language models for automated data science: Introducing caafe for context-aware automated feature engineering

Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[21] [21]

The autofeat python library for automated feature engineering and selection

Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. In Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I, pages 111–120. Springer, 2020

work page 2019

[22] [22]

Deep feature synthesis: Towards automating data science endeavors

James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA), pages 1–10. IEEE, 2015

work page 2015

[23] [23]

Feature engineering for predictive modeling using reinforcement learning

Udayan Khurana, Horst Samulowitz, and Deepak Turaga. Feature engineering for predictive modeling using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[24] [24]

Cognito: Automated feature engineering for supervised learning

Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. Cognito: Automated feature engineering for supervised learning. In 2016 IEEE 16th international conference on data mining workshops (ICDMW), pages 1304–1307. IEEE, 2016

work page 2016

[25] [25]

Large language models engineer too many simple features for tabular data

Jaris Küken, Lennart Purucker, and Frank Hutter. Large language models engineer too many simple features for tabular data. arXiv preprint arXiv:2410.17787, 2024

work page arXiv 2024

[26] [26]

Large language models as evolution strategies

Robert Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 579–582, 2024

work page 2024

[27] [27]

Evolution through large models

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331–366. Springer, 2023

work page 2023

[28] [28]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024

work page arXiv 2024

[29] [29]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[30] [30]

Language model crossover: Variation through few-shot prompting

Elliot Meyerson, Mark J Nelson, Herbie Bradley, Adam Gaier, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. ACM Transactions on Evolutionary Learning, 4(4):1–40, 2024

work page 2024

[31] [31]

Optimized feature generation for tabular data via llms with decision tree reasoning

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized feature generation for tabular data via llms with decision tree reasoning. arXiv preprint arXiv:2406.08527, 2024

work page arXiv 2024

[32] [32]

Stunt: Few-shot tabular learning with self-generated tasks from unlabeled tables

Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, and Jinwoo Shin. Stunt: Few-shot tabular learning with self-generated tasks from unlabeled tables. arXiv preprint arXiv:2303.00918, 2023. 11

work page arXiv 2023

[33] [33]

Learning feature engineering for classification

Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil, and Deepak S Turaga. Learning feature engineering for classification. In Ijcai, volume 17, pages 2529–2535, 2017

work page 2017

[34] [34]

GPT-4 Technical Report

R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

work page 2024

[36] [36]

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400, 2024

work page arXiv 2024

[37] [37]

Openml: networked science in machine learning

Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

work page 2014

[38] [38]

Attention is all you need

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017

work page 2017

[39] [39]

Anypredict: Foundation model for tabular prediction

Zifeng Wang, Chufan Gao, Cao Xiao, and Jimeng Sun. Anypredict: Foundation model for tabular prediction. CoRR, 2023

work page 2023

[40] [40]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[41] [41]

Evolutionary computation in the era of large language model: Survey and roadmap

Xingyu Wu, Sheng-hao Wu, Jibin Wu, Liang Feng, and Kay Chen Tan. Evolutionary computation in the era of large language model: Survey and roadmap. arXiv preprint arXiv:2401.10034, 2024

work page arXiv 2024

[42] [42]

Making pre-trained language models great on tabular prediction

Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Z Chen, Jimeng Sun, Jian Wu, and Jintai Chen. Making pre-trained language models great on tabular prediction. arXiv preprint arXiv:2403.01841, 2024

work page arXiv 2024

[43] [43]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers, 2024

work page 2024

[44] [44]

Automatic feature engineering by deep reinforcement learning

Jianyu Zhang, Jianye Hao, Françoise Fogelman-Soulié, and Zan Wang. Automatic feature engineering by deep reinforcement learning. InProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 2312–2314, 2019

work page 2019

[45] [45]

Openfe: automated feature generation with expert-level performance

Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023

work page 2023

[46] [46]

Can GPT -4 Perform Neural Architecture Search ?, August 2023

Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can gpt-4 perform neural architecture search? arXiv preprint arXiv:2304.10970, 2023

work page arXiv 2023

[47] [47]

""Improved version of modify_features_v0

Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023. Impact Statement The introduction of LLM-FE as a framework for leveraging LLMs in automated feature engineering has the potential to significantly impact the field of machine learning b...

work page arXiv 2023