pith. sign in

arxiv: 2606.02384 · v1 · pith:TKCNASRLnew · submitted 2026-06-01 · 💻 cs.LG

TabPrep: Closing the Feature Engineering Gap in Tabular Benchmarks

Pith reviewed 2026-06-28 15:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords tabular machine learningfeature engineeringpreprocessingbenchmarksTabPrepmodel blind spotsautomated feature engineeringTabArena
0
0 comments X

The pith

TabPrep shows that a targeted preprocessing pipeline closes the feature engineering gap in tabular benchmarks by raising performance across model classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that modern tabular benchmarks omit feature engineering even though it remains essential in practice, creating an unmeasured gap between reported results and real pipelines. It presents TabPrep as a lightweight set of feature generators that address three recurring structural patterns in data. Models from several families exhibit consistent blind spots to these patterns, so the generators supply the missing signals. When the pipeline is added to training and tuning on the TabArena collection, accuracy rises for tree-based, neural, linear, and foundation models, and the lift frequently exceeds what new model architectures alone deliver. The approach also runs faster and applies more broadly than earlier automated feature-engineering tools, making it practical to include in large benchmark suites.

Core claim

TabPrep is a preprocessing pipeline built from feature generators that target three specific structural data patterns. Many widely used model classes display predictable blind spots to these patterns. Systematic application of the generators during training and hyperparameter tuning produces higher performance on the TabArena benchmark for tree-based, neural, linear, and foundation models, with the improvements often larger than those obtained from model-centric changes alone. The same pipeline also surpasses prior automated feature-engineering methods in accuracy, speed, and dataset coverage, allowing feature engineering to be added to future tabular evaluations.

What carries the argument

TabPrep, a lightweight preprocessing pipeline of feature generators that target three structural data patterns models commonly overlook.

If this is right

  • Tree-based, neural, linear, and foundation models all register measurable gains once TabPrep is inserted into training and tuning.
  • New peak results on tabular benchmarks can be reached through feature engineering without altering the underlying model architecture.
  • TabPrep runs with low enough overhead to be used inside large-scale benchmark evaluations.
  • The pipeline beats earlier automated feature engineering methods on accuracy, runtime, and breadth of applicable datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standard tabular benchmarks could be extended to require or strongly encourage the use of such preprocessing so that reported rankings better reflect complete pipelines.
  • The three targeted patterns could be examined as a diagnostic checklist when new tabular datasets are released.
  • If the patterns prove general, practitioners might default to running TabPrep before testing complex models, shifting effort from architecture search to data preparation.
  • Researchers could measure how much of the current gap between benchmark and deployed performance disappears once feature generators of this form are included.

Load-bearing premise

The three structural data patterns the generators address are the main blind spots shared across model classes and the observed gains will hold on data outside the TabArena collection.

What would settle it

A fresh collection of tabular datasets on which adding TabPrep produces no consistent accuracy gain or produces losses for the same model families would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.02384 by Andrej Tschalzev, Christian Bartelt, Heiner Stuckenschmidt, Huzefa Rangwala, Nick Erickson, Stefan L\"udtke, Yuyang Wang.

Figure 1
Figure 1. Figure 1: Illustration of how TabPrep exposes the peak performance gap missed by models. We zoom into the top of the leaderboard per dataset on the TabArena benchmark and compare recent model-centric improvements (TabPFN-2.6 [1], TabICLv2 [2]) along with our data-centric improvements from augmenting four models with TabPrep. Improvement is relative to the previous best out of 25 TabArena models tuned for 200 configu… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the three structural patterns targeted by TabPrep and the feature generators addressing them. The generator illustration focuses on the types of the generated features. The underlying procedures for feature filtering and prioritization under limited budgets are in Appendix C. 3.2 Ordered Arithmetic Feature Expansion for Algebraic Structure in Numeric Features Data Structure. Many tabular datase… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of models augmented with TabPrep on the TabArena benchmark. Elo scores (higher is better) are averaged over datasets. Models labeled with Prep use TabPrep for data preprocessing, while all other results are taken from the official TabArena benchmark [26]. Left: Results for the full 51 TabArena datasets; Right: Results for the 15 largest datasets in the benchmark, between 10,000 and 100,000 trai… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Relative improvement of TabPrep compared to autoFE libraries. Right: Relative improvement of each generator when sequentially added to a default LightGBM model. memory or time limits on 33 and 22 datasets, respectively, whereas TabPrep runs on all datasets. Even on successful runs, the median single-model training-time increase is 180× for autofeat and 19× for OpenFE, compared to only 6× for TabPrep … view at source ↗
Figure 5
Figure 5. Figure 5: Improvability tuning trajectories. Points from left to right mark ensembles of increasing numbers of random configurations (1, 2, 5, 10, 25, 50, 100, 150, 201). The trajectories are sampled 20 times from all trials and averaged. Improvability (lower is better) measures how many percent lower the error of the best method is than the current method on a dataset, averaged over datasets. Nevertheless, our resu… view at source ↗
Figure 6
Figure 6. Figure 6: Performance of models augmented with TabPrep on the small datasets of the TabArena benchmark. Elo scores (higher is better) are averaged over datasets comparing default models, tuned models, and tuned+ensembled variants. Models labeled with Prep use TabPrep for data preprocessing, while all other results are taken from the official TabArena benchmark [26]. Results on the 35 small TabArena datasets, below 1… view at source ↗
Figure 7
Figure 7. Figure 7: Performance of models augmented with TabPrep on the regression datasets of the TabArena benchmark. Elo scores (higher is better) are averaged over datasets comparing default models, tuned models, and tuned+ensembled variants. Models labeled with Prep use TabPrep for data preprocessing, while all other results are taken from the official TabArena benchmark [26]. B.2 Additional Results on Performance-Efficie… view at source ↗
Figure 8
Figure 8. Figure 8: Performance of models augmented with TabPrep on the binary datasets of the TabArena benchmark. Elo scores (higher is better) are averaged over datasets comparing default models, tuned models, and tuned+ensembled variants. Models labeled with Prep use TabPrep for data preprocessing, while all other results are taken from the official TabArena benchmark [26]. Linear TorchMLP RandomForest CatBoost TabM LightG… view at source ↗
Figure 9
Figure 9. Figure 9: Performance of models augmented with TabPrep on the multi-class datasets of the TabArena benchmark. Elo scores (higher is better) are averaged over datasets comparing default models, tuned models, and tuned+ensembled variants. Models labeled with Prep use TabPrep for data preprocessing, while all other results are taken from the official TabArena benchmark [26]. excessive inference-time overhead, highlight… view at source ↗
Figure 10
Figure 10. Figure 10: Pareto frontier of improvability and inference time. Following the TabArena bench￾mark, we report the median inference time per 1000 samples across all datasets. 10 0 10 1 10 2 10 3 Train time per 1K samples (s) (median) 800 1000 1200 1400 1600 1800 Elo Optimal PrepTabPFN-2.5 RealTabPFN-2.5 PrepLightGBM LightGBM PrepTabM TabM PrepLinear Linear 10 1 10 0 10 1 Inference time per 1K samples (s) (median) 800 … view at source ↗
Figure 11
Figure 11. Figure 11: Elo tuning trajectories for the 15 largest datasets in TabArena. Left: training-time tuning trajectories. Right: inference-time tuning trajectories. Points from left to right mark ensembles of increasing numbers of random configurations (1, 2, 5, 10, 25, 50, 100, 150, 201), sampled 20 times from all trials and averaged. The right-most highlighted points use all configurations. Higher is better for the Elo… view at source ↗
Figure 12
Figure 12. Figure 12: Pairwise win rate comparison for all datasets. Higher numbers correspond to a better win rate for the model on the y-axis. −0.10 −0.05 0.00 0.05 0.10 0.15 0.20 0.25 Improvement vs Default LightGBM 100 features 500 features 1000 features 2000 features [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity analysis of TabPrep improvements with varying the number of added features. B.6 Leave-One-Out Contributions for the TabPrep Generators [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Analysis of sensitivity to the random seed in the feature generation of TabPrep. We examine seed sensitivity for a random batch of 20 TabArena datasets (first 20 when sorting names alphabetically). The blue curves correspond to the error distribution estimated from 200 LightGBM configurations without TabPrep. The red curves are the error distributions obtained with TabPrep under 5 different random seeds f… view at source ↗
Figure 15
Figure 15. Figure 15: Leave-one-out contribution of each preprocessor to performance. B.7 TabPrep as a Hyperparameter vs. Always On We study how models differ in their ability to benefit from TabPrep [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Relative improvements of always using TabPrep vs. treating it as a hyperparameter in the search space. Left: PrepModels denotes results from a search space where TabPrep is used in each configuration. Right: (Prep)Models denotes results from a search space where non-augmented configurations are included. Both are compared to the base models in the tuned + ensembled regime (200 TabArena tuning configs). Li… view at source ↗
Figure 17
Figure 17. Figure 17: Performance of LightGBM augmented with TabPrep on the TabArena benchmark. Left: Results on the first three outer splits of the TabArena benchmark. Right: Results on the full benchmark with 30 outer folds for small, and 9 for medium-sized datasets. Elo scores (higher is better) are averaged over datasets. Models labeled with Prep use TabPrep for data preprocessing, while all other results are taken from th… view at source ↗
Figure 18
Figure 18. Figure 18: Model comparison on data with algebraic structure. Each plot shows model error, where lower is better. We show the error for the default model and for variants augmented with arithmetic interactions up to order r using Ordered Arithmetic Feature Expansion with a budget of up to 2000 generated features. The displayed models are TabPFN-2.5 (PFN), TabM, LightGBM (GBM), and CatBoost (CAT). jungle represents a… view at source ↗
Figure 19
Figure 19. Figure 19: Model comparison on data with cross-categorical target effects (lower is better). We compare the models’ default categorical feature handling (None), out-of-fold target encoding (OOF￾TE), appending OOF-TE while keeping the original categorical features (OOF-TE_APPEND), adding the Cartesian product of the categorical features (CATINT), and applying OOF-TE to categorical interactions (CATINT_OOFTE). The sim… view at source ↗
Figure 20
Figure 20. Figure 20: Model comparison on data with group-conditional numerical feature effects (lower is better). We compare the models’ default categorical feature handling (No FE), absolute Group￾ByThenMean interactions (GroupBy), and relative GroupByThenMean interactions obtained by subtracting the group mean from the base numerical feature (Diff-GroupBy). The simulated dataset has 1000 samples, and the regression target d… view at source ↗
Figure 21
Figure 21. Figure 21: Model comparison on data with group-conditional numerical feature effects (lower is better). We compare the models default categorical feature handling (No FE), absolute Group￾ByThenMean interactions (GroupBy), and GroupByThenMean interactions subtracted from the base numerical feature (Diff-GroupBy). The simulated dataset has 1000 samples and the regression target depends on 5 numerical features and one … view at source ↗
Figure 22
Figure 22. Figure 22: Model comparison with Random Subset Feature Compression. artificial-characters (left) is a dataset containing pseudo-categorical numerical fea￾tures, while ada_agnostic (right) contains many low-cardinality features. We evaluate performance using log loss, where lower is better. None denotes the model without feature engineering. RSFC-noround-1-order denotes Random Subset Feature Compression with max_orde… view at source ↗
Figure 23
Figure 23. Figure 23: Performance comparison of TabPrep-augmented models and autoFE libraries. The left panel includes only datasets on which autofeat and OpenFE ran successfully. The right panel includes all datasets. Improvements are measured relative to the corresponding model without feature engineering, i.e., a linear model for autofeat and LightGBM for OpenFE. arithmetic interactions. OpenFE computes the same transformat… view at source ↗
read the original abstract

Progress in tabular machine learning has largely focused on increasingly sophisticated model architectures. At the same time, feature engineering remains a critical yet underexplored component of real-world modeling pipelines that is entirely absent from modern benchmarks, which creates an unquantified evaluation gap. In this work, we introduce TabPrep, a lightweight preprocessing pipeline composed of feature generators that are carefully designed to target three specific structural data patterns. We show that many widely used model classes exhibit predictable blind spots to these patterns and that systematic feature engineering alone can establish new peak performance. Across the TabArena benchmark, integrating TabPrep into model training and tuning consistently improves performance for tree-based, neural, linear, and foundation models, often surpassing gains achieved by model-centric innovations alone. TabPrep outperforms previous automated feature engineering approaches in performance, efficiency, and applicability across datasets, enabling integration into large-scale benchmarks. By releasing TabPrep (see https://github.com/atschalz/tabprep), we enable researchers to integrate feature engineering into their benchmarking setup, filling a longstanding gap in tabular evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TabPrep, a lightweight preprocessing pipeline consisting of feature generators that target three specific structural data patterns in tabular data. The central claim is that widely used model classes (tree-based, neural, linear, and foundation models) exhibit predictable blind spots to these patterns, and that integrating TabPrep into training and tuning on the TabArena benchmark consistently improves performance, often surpassing gains from model-centric innovations alone. TabPrep is positioned as more efficient and broadly applicable than prior automated feature engineering methods, with open-source release to enable its use in large-scale benchmarks.

Significance. If the results hold, the work would be significant for tabular ML by providing a concrete, reproducible way to close the feature-engineering gap in benchmarks that currently focus almost exclusively on model architectures. The open-source release and emphasis on efficiency across model classes are strengths that could shift evaluation practices toward more realistic pipelines.

major comments (2)
  1. [§3] §3 (Pattern Selection and Generator Design): The manuscript provides no ablation or systematic justification for why these exact three structural patterns (rather than others) constitute the primary blind spots across model classes; without such evidence the claim that TabPrep systematically closes the benchmark gap does not follow from the TabArena results alone.
  2. [§5] §5 (TabArena Experiments): All reported gains are confined to the TabArena collection; no external datasets or cross-benchmark validation is presented to test whether the observed improvements generalize or whether equivalent gains could be obtained by generic preprocessing, which is load-bearing for the assertion that TabPrep establishes new peaks beyond model-centric advances.
minor comments (2)
  1. [Abstract] The abstract states performance improvements but contains no quantitative numbers, error bars, or dataset counts; adding a single sentence with key metrics would improve clarity.
  2. [§3] Notation for the three structural patterns is introduced without an explicit summary table; a small table listing each pattern, the corresponding generator, and the targeted model blind spot would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. Below we respond point-by-point to the two major comments. Where the comments identify gaps that can be addressed by additional analysis or experiments, we commit to revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Pattern Selection and Generator Design): The manuscript provides no ablation or systematic justification for why these exact three structural patterns (rather than others) constitute the primary blind spots across model classes; without such evidence the claim that TabPrep systematically closes the benchmark gap does not follow from the TabArena results alone.

    Authors: The three patterns were chosen after reviewing prior literature on tabular data characteristics that are known to challenge standard model families (e.g., interaction effects, missing-value mechanisms, and scale heterogeneity). Preliminary experiments on a subset of TabArena datasets confirmed that these patterns produced measurable performance drops when left unaddressed. We acknowledge, however, that a systematic ablation comparing these patterns against plausible alternatives was not included. In the revised manuscript we will add such an ablation (both on pattern inclusion and on alternative generators) to provide the requested justification. revision: yes

  2. Referee: [§5] §5 (TabArena Experiments): All reported gains are confined to the TabArena collection; no external datasets or cross-benchmark validation is presented to test whether the observed improvements generalize or whether equivalent gains could be obtained by generic preprocessing, which is load-bearing for the assertion that TabPrep establishes new peaks beyond model-centric advances.

    Authors: TabArena was selected because it is currently the largest and most standardized tabular benchmark that already controls for model tuning. Nevertheless, we agree that demonstrating generalization beyond this collection is important. In the revised manuscript we will report results on at least two additional public tabular datasets drawn from sources outside TabArena, together with a comparison against generic preprocessing baselines, to address the concern about external validity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper introduces TabPrep as a preprocessing pipeline targeting three structural patterns and evaluates its impact via direct performance comparisons on the TabArena benchmark across model classes. All load-bearing claims (performance gains, outperformance of prior automated FE) are grounded in external empirical results rather than any derivation, equation, or self-citation that reduces to the paper's own inputs by construction. No self-definitional steps, fitted-input predictions, or uniqueness theorems appear; the work is a standard empirical contribution whose validity can be checked against the released code and benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5738 in / 1003 out tokens · 31111 ms · 2026-06-28T15:41:26.076860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

131 extracted references · 34 canonical work pages · 8 internal anchors

  1. [1]

    Tabpfn-2.6

    Prior Labs. Tabpfn-2.6. https://huggingface.co/Prior-Labs/tabpfn_2_6, 2025. Hug- ging Face model card, version v1.0

  2. [2]

    arXiv preprint arXiv:2602.11139 , year =

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

  3. [3]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848, 2022

  4. [4]

    Grande: Gradient- based decision tree ensembles for tabular data.arXiv preprint arXiv:2309.17130, 2023

    Sascha Marton, Stefan Lüdtke, Christian Bartelt, and Heiner Stuckenschmidt. Grande: Gradient- based decision tree ensembles for tabular data.arXiv preprint arXiv:2309.17130, 2023

  5. [5]

    Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

    David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

  6. [6]

    Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024. 10

  7. [7]

    Enabling mixed effects neural networks for diverse, clustered data using monte carlo methods.arXiv preprint arXiv:2407.01115, 2024

    Andrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, and Heiner Stuckenschmidt. Enabling mixed effects neural networks for diverse, clustered data using monte carlo methods.arXiv preprint arXiv:2407.01115, 2024

  8. [8]

    arXiv preprint arXiv:2410.18164 , year=

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tabdpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164, 2024

  9. [9]

    Modern neighborhood components analysis: A deep tabular baseline two decades later.arXiv preprint arXiv:2407.03257, 2024

    Han-Jia Ye, Huai-Hong Yin, and De-Chuan Zhan. Modern neighborhood components analysis: A deep tabular baseline two decades later.arXiv preprint arXiv:2407.03257, 2024

  10. [10]

    TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025

  11. [11]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

  12. [12]

    Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

    Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al. Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

  13. [13]

    Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025

    Xiyuan Zhang, Danielle C Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W Mahoney, et al. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025

  14. [14]

    Contexttab: A semantics-aware tabular in-context learner.arXiv preprint arXiv:2506.10707, 2025

    Marco Spinaci, Marek Polewczyk, Maximilian Schambach, and Sam Thelin. Contexttab: A semantics-aware tabular in-context learner.arXiv preprint arXiv:2506.10707, 2025

  15. [15]

    Pytorch tabular: A framework for deep learning with tabular data.arXiv preprint arXiv:2104.13638, 2021

    Manu Joseph. Pytorch tabular: A framework for deep learning with tabular data.arXiv preprint arXiv:2104.13638, 2021

  16. [16]

    On the efficiency of nlp-inspired methods for tabular deep learning.arXiv preprint arXiv:2411.17207, 2024

    Anton Frederik Thielmann and Soheila Samiee. On the efficiency of nlp-inspired methods for tabular deep learning.arXiv preprint arXiv:2411.17207, 2024

  17. [17]

    Large scale transfer learning for tabular data via language modeling

    Joshua P Gardner, Juan Carlos Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  18. [18]

    A data-centric perspective on evaluating machine learning models for tabular data

    Andrej Tschalzev, Sascha Marton, Stefan Lüdtke, Christian Bartelt, and Heiner Stuckenschmidt. A data-centric perspective on evaluating machine learning models for tabular data. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  19. [19]

    Unreflected use of tabular data repositories can undermine research quality

    Andrej Tschalzev, Lennart Purucker, Stefan Lüdtke, Frank Hutter, Christian Bartelt, and Heiner Stuckenschmidt. Unreflected use of tabular data repositories can undermine research quality. In The Future of Machine Learning Data Practices and Repositories at ICLR 2025, 2025

  20. [20]

    2023 kaggle ai report, 2023

    Bojan Tunguz, Dieter, Heads or Tails, Karnika Kapoor, Parul Pandey, Paul Mooney, Phil Culliton, Rob Mulla, Sanyam Bhutani, and Will Cukierski. 2023 kaggle ai report, 2023. URL https://kaggle.com/competitions/2023-kaggle-ai-report

  21. [21]

    Tabred: Analyzing pit- falls and filling the gaps in tabular deep learning benchmarks.arXiv preprint arXiv:2406.19380, 2024

    Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. Tabred: Analyzing pit- falls and filling the gaps in tabular deep learning benchmarks.arXiv preprint arXiv:2406.19380, 2024

  22. [22]

    Openfe: Automated feature generation with expert-level performance

    Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: Automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023

  23. [23]

    The autofeat python library for automated feature engineering and selection

    Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 111–120. Springer, 2019

  24. [24]

    How usable is automated feature engineering for tabular data?arXiv preprint arXiv:2508.13932, 2025

    Bastian Schäfer, Lennart Purucker, Maciej Janowski, and Frank Hutter. How usable is automated feature engineering for tabular data?arXiv preprint arXiv:2508.13932, 2025

  25. [25]

    Hollmann, S

    N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025. 11

  26. [26]

    TabArena: A Living Benchmark for Machine Learning on Tabular Data

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

  27. [27]

    Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022

    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022

  28. [28]

    When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakr- ishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

  29. [29]

    Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

    Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

  30. [31]

    Bischl, G

    B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. Mantovani, J. N. van Rijn, and J. Vanschoren. Openml benchmarking suites and the openml100.arXiv:1708.03731v1 [stat.ML], 2019

  31. [32]

    Revisiting deep learning models for tabular data.Advances in neural information processing systems, 34: 18932–18943, 2021

    Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data.Advances in neural information processing systems, 34: 18932–18943, 2021

  32. [33]

    Assembled-openml: creating efficient benchmarks for ensembles in automl with openml.arXiv preprint arXiv:2307.00285, 2023

    Lennart Purucker and Joeran Beel. Assembled-openml: creating efficient benchmarks for ensembles in automl with openml.arXiv preprint arXiv:2307.00285, 2023

  33. [34]

    Benchmarking distribution shift in tabular data with tableshift.Advances in Neural Information Processing Systems, 36:53385–53432, 2023

    Josh Gardner, Zoran Popovic, and Ludwig Schmidt. Benchmarking distribution shift in tabular data with tableshift.Advances in Neural Information Processing Systems, 36:53385–53432, 2023

  34. [35]

    Tabrepo: A large scale repository of tabular model evaluations and its automl applications

    David Salinas and Nick Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications. InAutoML Conference 2024 (ABCD Track), 2024

  35. [36]

    Towards quantifying the effect of datasets for benchmarking: A look at tabular machine learning

    Ravin Kohli, Matthias Feurer, Katharina Eggensperger, Bernd Bischl, and Frank Hutter. Towards quantifying the effect of datasets for benchmarking: A look at tabular machine learning. In ICLR Workshop, volume 2, page 6, 2024

  36. [37]

    A comprehensive benchmark of machine and deep learning across diverse tabular datasets.arXiv preprint arXiv:2408.14817, 2024

    Assaf Shmuel, Oren Glickman, and Teddy Lazebnik. A comprehensive benchmark of machine and deep learning across diverse tabular datasets.arXiv preprint arXiv:2408.14817, 2024

  37. [38]

    Is deep learning finally better than decision trees on tabular data?arXiv preprint arXiv:2402.03970, 2024

    Guri Zabërgja, Arlind Kadra, Christian Frey, and Josif Grabocka. Is deep learning finally better than decision trees on tabular data?arXiv preprint arXiv:2402.03970, 2024

  38. [39]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. 12:2825–2830, 2011

  39. [40]

    Feature-engine: A python package for feature engineering for machine learning

    Soledad Galli. Feature-engine: A python package for feature engineering for machine learning. Journal of Open Source Software, 6(65):3642, 2021

  40. [41]

    Category encoders: a scikit-learn- contrib package of transformers for encoding categorical data.Journal of Open Source Software, 3(21):501, 2018

    William D McGinnis, Chapman Siu, Hanyu Huang, et al. Category encoders: a scikit-learn- contrib package of transformers for encoding categorical data.Journal of Open Source Software, 3(21):501, 2018

  41. [42]

    Special issue on feature engineering editorial.Machine learning, 113(7):3917–3928, 2024

    Tim Verdonck, Bart Baesens, María Óskarsdóttir, and Seppe vanden Broucke. Special issue on feature engineering editorial.Machine learning, 113(7):3917–3928, 2024

  42. [43]

    Climb: Class-imbalanced learning benchmark on tabular data.arXiv preprint arXiv:2505.17451, 2025

    Zhining Liu, Zihao Li, Ze Yang, Tianxin Wei, Jian Kang, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. Climb: Class-imbalanced learning benchmark on tabular data.arXiv preprint arXiv:2505.17451, 2025

  43. [44]

    Imputation for prediction: beware of diminishing returns.arXiv preprint arXiv:2407.19804, 2024

    Marine Le Morvan and Gaël Varoquaux. Imputation for prediction: beware of diminishing returns.arXiv preprint arXiv:2407.19804, 2024

  44. [45]

    AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications

    Luo Yuanfei, Wang Mengshuo, Zhou Hao, Yao Quanming, Tu WeiWei, Chen Yuqiang, Yang Qiang, and Dai Wenyuan. Autocross: Automatic feature crossing for tabular data in real-world applications.arXiv preprint arXiv:1904.12857, 2019. 12

  45. [46]

    Deep feature synthesis: Towards automating data science endeavors

    James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In2015 IEEE international conference on data science and advanced analytics (DSAA), pages 1–10. IEEE, 2015

  46. [47]

    Auto- mated data science for relational data

    Hoang Thanh Lam, Beat Buesser, Hong Min, Tran Ngoc Minh, Martin Wistuba, Udayan Khurana, Gregory Bramble, Theodoros Salonidis, Dakuo Wang, and Horst Samulowitz. Auto- mated data science for relational data. In2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 2689–2692. IEEE, 2021

  47. [48]

    Generalized and heuristic-free feature construction for improved accuracy

    Wei Fan, Erheng Zhong, Jing Peng, Olivier Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, and Qiang Yang. Generalized and heuristic-free feature construction for improved accuracy. InProceedings of the 2010 SIAM International Conference on Data Mining, pages 629–640. SIAM, 2010

  48. [49]

    Safe: Scalable automatic feature engineering framework for industrial tasks

    Qitao Shi, Ya-Lin Zhang, Longfei Li, Xinxing Yang, Meng Li, and Jun Zhou. Safe: Scalable automatic feature engineering framework for industrial tasks. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1645–1656. IEEE, 2020

  49. [50]

    Learning feature engineering for classification

    Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil, and Deepak S Turaga. Learning feature engineering for classification. InIjcai, volume 17, pages 2529–2535, 2017

  50. [51]

    Bioautoml: automated feature engineering and metalearning to predict noncoding rnas in bacteria.Briefings in Bioinformatics, 23(4):bbac218, 2022

    Robson P Bonidia, Anderson P Avila Santos, Breno LS de Almeida, Peter F Stadler, Ulisses N da Rocha, Danilo S Sanches, and André CPLF de Carvalho. Bioautoml: automated feature engineering and metalearning to predict noncoding rnas in bacteria.Briefings in Bioinformatics, 23(4):bbac218, 2022

  51. [52]

    Adafs: Adaptive feature selection in deep recommender system

    Weilin Lin, Xiangyu Zhao, Yejing Wang, Tong Xu, and Xian Wu. Adafs: Adaptive feature selection in deep recommender system. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3309–3317, 2022

  52. [53]

    An introduction to variable and feature selection.Journal of machine learning research, 3(Mar):1157–1182, 2003

    Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection.Journal of machine learning research, 3(Mar):1157–1182, 2003

  53. [54]

    Hybrid bag-of-visual-words and featurewiz selection for content-based visual information retrieval

    Samy Bakheet, Ayoub Al-Hamadi, Emadeldeen Soliman, and Mohamed Heshmat. Hybrid bag-of-visual-words and featurewiz selection for content-based visual information retrieval. Sensors, 23(3):1653, 2023

  54. [55]

    Feature selection for high-dimensional data: A fast correlation-based filter solution

    Lei Yu and Huan Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. InProceedings of the 20th international conference on machine learning (ICML-03), pages 856–863, 2003

  55. [56]

    Feature selection with the boruta package.Journal of statistical software, 36:1–13, 2010

    Miron B Kursa and Witold R Rudnicki. Feature selection with the boruta package.Journal of statistical software, 36:1–13, 2010

  56. [57]

    Concrete autoencoders: Differentiable feature selection and reconstruction

    Muhammed Fatih Balın, Abubakar Abid, and James Zou. Concrete autoencoders: Differentiable feature selection and reconstruction. InInternational conference on machine learning, pages 444–453. PMLR, 2019

  57. [58]

    Feature selection using stochastic gates

    Yutaro Yamada, Ofir Lindenbaum, Sahand Negahban, and Yuval Kluger. Feature selection using stochastic gates. InInternational conference on machine learning, pages 10648–10659. PMLR, 2020

  58. [59]

    A tutorial-based survey on feature selection: Recent advancements on feature selection.Engineering applications of artificial intelligence, 126:107136, 2023

    Amir Moslemi. A tutorial-based survey on feature selection: Recent advancements on feature selection.Engineering applications of artificial intelligence, 126:107136, 2023

  59. [60]

    A performance-driven benchmark for feature selection in tabular deep learning.Advances in Neural Information Processing Systems, 36:41956–41979, 2023

    Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C Bayan Bruss, Andrew G Wilson, Tom Goldstein, and Micah Goldblum. A performance-driven benchmark for feature selection in tabular deep learning.Advances in Neural Information Processing Systems, 36:41956–41979, 2023

  60. [61]

    Analysis and compar- ison of feature selection methods towards performance and stability.Expert Systems with Applications, 249:123667, 2024

    Matheus Cezimbra Barbieri, Bruno Iochins Grisci, and Márcio Dorn. Analysis and compar- ison of feature selection methods towards performance and stability.Expert Systems with Applications, 249:123667, 2024

  61. [62]

    Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering.Advances in Neural Information Processing Systems, 36:44753–44775, 2023

  62. [63]

    LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

    Nikhil Abhyankar, Parshin Shojaee, and Chandan K Reddy. Llm-fe: Automated feature engi- neering for tabular data with llms as evolutionary optimizers.arXiv preprint arXiv:2503.14434, 2025. 13

  63. [64]

    Large language models engineer too many simple features for tabular data.arXiv preprint arXiv:2410.17787, 2024

    Jaris Küken, Lennart Purucker, and Frank Hutter. Large language models engineer too many simple features for tabular data.arXiv preprint arXiv:2410.17787, 2024

  64. [65]

    Elephants never forget: Memorization and learning of tabular data in large language models

    Sebastian Bordt, Harsha Nori, Vanessa Cristiny Rodrigues Vasconcelos, Besmira Nushi, and Rich Caruana. Elephants never forget: Memorization and learning of tabular data in large language models. InFirst Conference on Language Modeling, 2024

  65. [66]

    Multilayer feedforward networks are universal approximators.Neural networks, 2(5):359–366, 1989

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural networks, 2(5):359–366, 1989

  66. [67]

    T. M. Mitchell. The need for biases in learning generalizations. Technical report, Computer Science Department, Rutgers University, New Brunswick, MA, 1980

  67. [68]

    A model of inductive bias learning.Journal of artificial intelligence research, 12:149–198, 2000

    Jonathan Baxter. A model of inductive bias learning.Journal of artificial intelligence research, 12:149–198, 2000

  68. [69]

    Neural arithmetic logic units.Advances in neural information processing systems, 31, 2018

    Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units.Advances in neural information processing systems, 31, 2018

  69. [70]

    Neural arithmetic units.arXiv preprint arXiv:2001.05016, 2020

    Andreas Madsen and Alexander Rosenberg Johansen. Neural arithmetic units.arXiv preprint arXiv:2001.05016, 2020

  70. [71]

    Investigating the limitations of transformers with simple arithmetic tasks.arXiv preprint arXiv:2102.13019, 2021

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks.arXiv preprint arXiv:2102.13019, 2021

  71. [72]

    Springer, 2009

    Alain F Zuur, Elena N Ieno, Neil J Walker, Anatoly A Saveliev, Graham M Smith, et al.Mixed effects models and extensions in ecology with R, volume 574. Springer, 2009

  72. [73]

    lme4: Mixed-effects modeling with r, 2010

    Douglas M Bates. lme4: Mixed-effects modeling with r, 2010

  73. [74]

    Linear mixed-effects model

    Andrzej Gałecki and Tomasz Burzykowski. Linear mixed-effects model. InLinear mixed-effects models using R: a step-by-step approach, pages 245–273. Springer, 2012

  74. [75]

    Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

    Florian Pargent, Florian Pfisterer, Janek Thomas, and Bernd Bischl. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, 37(5):2671–2692, 2022

  75. [76]

    Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

  76. [77]

    Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018

  77. [78]

    A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001

    Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001

  78. [79]

    Popov, S

    S. Popov, S. Morozov, and A. Babenko. Neural Oblivious Decision Ensembles for Deep Learning on tabular data.arXiv:1909.06312v2 [cs.LG], 2019

  79. [80]

    AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

    N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluon- tabular: Robust and accurate automl for structured data.arXiv:2003.06505 [stat.ML], 2020

  80. [81]

    L. Breiman. Random forests. 45:5–32, 2001

Showing first 80 references.