arxiv: 2604.06814 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI

OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale

Dihong Jiang , Ruoqi Cao , Zhiyuan Dang , Li Huang , Qingsong Zhang , Zhiyu Wang , Shihao Piao , Shenggao Zhu

show 3 more authors

Jianlong Chang Zhouchen Lin Qi Tian

This is my paper

Pith reviewed 2026-05-10 18:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords tabular databenchmarkGBDTneural networksfoundation modelsmeta-featuresempirical study

0 comments

The pith

No single model family wins on tabular tasks across 3030 datasets, but isolated metafeatures show when trees, networks or foundation models each perform best.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds OmniTabBench, the largest collection of tabular datasets to date, with 3030 examples drawn from many sources and grouped by industry. It runs state-of-the-art GBDTs, neural networks and foundation models on every dataset and finds no category beats the others overall. A new analysis that looks at one dataset property at a time, such as size or feature skewness, then maps which property favors which model type. This approach gives clearer selection rules than earlier studies that mixed many properties together. The scale of the collection is meant to make the patterns more reliable than results from the usual handful of datasets.

Core claim

Evaluation of leading GBDTs, neural networks and foundation models on the 3030 datasets of OmniTabBench shows that no model category is dominant across all tasks, while a decoupled metafeature analysis of individual dataset properties identifies measurable conditions under which each category tends to excel.

What carries the argument

OmniTabBench, the 3030-dataset collection, together with the decoupled metafeature analysis that tests each dataset property separately rather than in compound scores.

If this is right

Model choice can be guided by a few easy-to-measure dataset traits instead of defaulting to one family.
Benchmarks must grow in size and property diversity to produce stable rankings.
Development of foundation models can target the specific dataset traits where they already show an edge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

An automated selector could measure a few metafeatures at the start of a project and route the task to the favored model type.
Research attention may shift from seeking one universal tabular model toward lighter, property-aware ensembles.
The same decoupled analysis could be applied to new model families as they appear to keep the guidance current.

Load-bearing premise

The 3030 datasets are representative of real tabular work and the metafeature analysis cleanly isolates favoring conditions without hidden interactions or selection bias.

What would settle it

Re-running the same models on the same 3030 datasets and splits and finding that one model family wins on a clear majority of tasks.

Figures

Figures reproduced from arXiv: 2604.06814 by Dihong Jiang, Jianlong Chang, Li Huang, Qingsong Zhang, Qi Tian, Ruoqi Cao, Shenggao Zhu, Shihao Piao, Zhiyuan Dang, Zhiyu Wang, Zhouchen Lin.

**Figure 1.** Figure 1: Workflow of constructing OmniTabBench 45 medium-sized dataset (∼10k). Both of them collect datasets solely from OpenML platform. Despite the existence of these benchmarks, many recent studies in tabular domain still did not choose them in the evaluation. Instead, they evaluate models on their own proposed/collected benchmarks (Gorishniy et al., 2021; 2024; Hollmann et al., 2025; Holzmuller et al. ¨ , 2024… view at source ↗

**Figure 2.** Figure 2: Comparison between OmniTabBench and existing representative benchmarks. (a) We visualize the number of rows, columns, and the percentage of categorical columns per dataset in OmniTabBench, as well as their distributions. (b) TabZilla and TabArena contain notably fewer datasets than OmniTabBench [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Categorization of OmniTabBench by industries 4. Experiments 4.1. Experimental Setup and Selected Models Benchmarked Models: In light of the scale of our collected datasets, it is too costly to run experiments with many benchmarked models in an exhaustive manner as is completed in McElfresh et al. (2023); Erickson et al. (2025). Instead, we choose a few representative models in our experiments: (1) GBDTs: … view at source ↗

**Figure 4.** Figure 4: Rank of different models with increasing the number of evaluation datasets. The zoom-in window takes a closer look at the rank variation with a limited number of datasets. We plot this figure from datasets that all eight models have results. on a subset (1815 datasets) of our benchmark, which is already 12× more than their evaluated real datasets. 4.2. Preprocessing While TabPFN and many GBDT implementatio… view at source ↗

**Figure 5.** Figure 5: The distribution of performance gap between different pairs of models. Five columns represent five different metafeatures, and three rows denote three pairwise comparison. Performance gap on different datasets refers to the subtraction of the score between the former and latter models (for example, NN vs GBDT means subtracting score of GBDT from NN), which are quantified by red (positive/win) and blue (neg… view at source ↗

**Figure 6.** Figure 6: The distribution of performance gap between NNs and GBDT. Performance gap on different datasets refers to the subtraction of the score between the former and latter models (for example, NN vs GBDT means subtracting score of GBDT from NN), which are quantified by red (positive/win) and blue (negative/loss) points, respectively. We also fit a PDF of the points along each varying metafeatures via kernel densi… view at source ↗

**Figure 7.** Figure 7: The distribution of performance gap between TabPFN and NN. 0 100 200 300 400 500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 #Cols 0 10000 20000 30000 40000 50000 60000 70000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 #Rows 0 250 500 750 1000 1250 1500 1750 2000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Ratio of #Rows to #Cols 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Perce… view at source ↗

**Figure 8.** Figure 8: The distribution of performance gap between TabPFN and GBDT. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

While traditional tree-based ensemble methods have long dominated tabular tasks, deep neural networks and emerging foundation models have challenged this primacy, yet no consensus exists on a universally superior paradigm. Existing benchmarks typically contain fewer than 100 datasets, raising concerns about evaluation sufficiency and potential selection biases. To address these limitations, we introduce OmniTabBench, the largest tabular benchmark to date, comprising 3030 datasets spanning diverse tasks that are comprehensively collected from diverse sources and categorized by industry using large language models. We conduct an unprecedented large-scale empirical evaluation of state-of-the-art models from all model families on OmniTabBench, confirming the absence of a dominant winner. Furthermore, through a decoupled metafeature analysis, which examines individual properties such as dataset size, feature types, feature and target skewness/kurtosis, we elucidate conditions favoring specific model categories, providing clearer, more actionable guidance than prior compound-metric studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniTabBench scales tabular benchmarks to 3030 datasets and tries a decoupled metafeature breakdown, but the dataset collection and LLM labeling steps lack the validation needed to back the no-winner and condition-specific claims.

read the letter

The core takeaway is that this work gives us a much larger collection of tabular datasets than prior benchmarks and attempts to link individual metafeatures like size, skewness, and feature types to model family performance. That scale and the move away from single overall scores are the parts worth paying attention to if the details hold up. The evaluation covers GBDTs, neural nets, and foundation models across the full set and reports no single category dominating, which matches what many practitioners already observe in mixed real-world settings. The decoupled analysis is a reasonable step beyond compound metrics, and it could offer more usable pointers on when to prefer one approach over another. The main weakness sits in the data pipeline. The abstract notes collection from diverse sources and LLM-driven industry categorization, yet there is no quantitative audit of inclusion rules, no human validation or agreement metrics for the labels, and no sensitivity checks on whether the metafeature regressions stay stable once interactions or collinearities are modeled. If the sampling frame or the decoupling step turns out to be sensitive to those choices, both the absence-of-winner result and the favoring-conditions tables become conditional on an untested distribution. This paper is mainly for applied researchers and engineers who pick models for tabular tasks and want a broader empirical map than the usual small-benchmark papers supply. A reader focused on benchmark construction or metafeature studies would find the raw scale useful to examine. It deserves peer review because the size of the effort is substantial and the questions it raises are practical; referees can check the missing validation steps and ask for the necessary robustness tests without the work being dismissed outright.

Referee Report

2 major / 2 minor

Summary. The paper introduces OmniTabBench, the largest tabular benchmark to date with 3030 datasets collected from diverse sources and categorized by industry using LLMs. It reports a large-scale empirical evaluation of GBDTs, neural networks, and foundation models, finding no dominant winner across tasks. Through a decoupled metafeature analysis examining properties such as dataset size, feature types, and skewness/kurtosis, it identifies conditions favoring specific model categories, claiming clearer guidance than prior compound-metric studies.

Significance. If the central claims hold, the unprecedented scale provides a stronger empirical foundation than existing benchmarks limited to under 100 datasets, potentially shifting practice toward metafeature-guided model selection rather than defaulting to GBDTs. The decoupled analysis approach is a methodological strength that could yield actionable insights if validated. The work earns credit for its breadth of model families evaluated and the attempt to move beyond aggregate rankings, but its impact depends on addressing robustness concerns around data curation.

major comments (2)

[§3] §3 (Benchmark Construction and Dataset Collection): The central claims of no dominant winner and metafeature-derived conditions rest on the 3030 datasets being a minimally biased sample of real-world tabular tasks. However, the manuscript describes collection from 'diverse sources' and LLM-based industry labeling without providing a quantitative audit of inclusion/exclusion decisions, inter-annotator agreement metrics, human validation of LLM labels, or sensitivity checks showing that metafeature regressions remain stable when pairwise interactions or collinearities are modeled. This is load-bearing for the 'absence of winner' result and the 'conditions favoring' tables.
[§4] §4 (Experimental Evaluation): The no-winner conclusion and category comparisons require details on model implementations, hyperparameter search protocols, statistical controls for multiple testing, and error analysis. The abstract and methods summary do not specify these, leaving open whether performance differences are robust to implementation choices or data curation decisions, directly affecting the reliability of the decoupled metafeature mappings in §5.

minor comments (2)

[§5] Notation for metafeatures (e.g., skewness and kurtosis definitions) could be clarified with explicit formulas to aid reproducibility.
Figure captions for performance heatmaps should explicitly state the aggregation method (e.g., mean rank or win rate) to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify areas where additional transparency can strengthen the manuscript. We address each major point below and commit to revisions that enhance the documentation of our methods without altering the core empirical findings.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction and Dataset Collection): The central claims of no dominant winner and metafeature-derived conditions rest on the 3030 datasets being a minimally biased sample of real-world tabular tasks. However, the manuscript describes collection from 'diverse sources' and LLM-based industry labeling without providing a quantitative audit of inclusion/exclusion decisions, inter-annotator agreement metrics, human validation of LLM labels, or sensitivity checks showing that metafeature regressions remain stable when pairwise interactions or collinearities are modeled. This is load-bearing for the 'absence of winner' result and the 'conditions favoring' tables.

Authors: We agree that explicit documentation of curation decisions is necessary to support the generalizability of our results. In the revised version, we will add a new subsection in §3 that lists all source repositories, states the precise inclusion criteria (e.g., minimum sample size, removal of duplicates and synthetic data), and reports the number of candidate datasets screened versus retained. For the LLM-based industry labels, we will perform and report human validation on a stratified sample of 150 datasets, including inter-annotator agreement (Cohen’s κ) between two independent annotators and the LLM. We will also augment the metafeature analysis in §5 with sensitivity regressions that explicitly include pairwise interaction terms and variance-inflation-factor diagnostics to verify that the reported condition-specific advantages remain stable. These additions directly address the load-bearing concerns while preserving the scale and diversity of the benchmark. revision: yes
Referee: [§4] §4 (Experimental Evaluation): The no-winner conclusion and category comparisons require details on model implementations, hyperparameter search protocols, statistical controls for multiple testing, and error analysis. The abstract and methods summary do not specify these, leaving open whether performance differences are robust to implementation choices or data curation decisions, directly affecting the reliability of the decoupled metafeature mappings in §5.

Authors: We acknowledge that the main-text methods summary is currently concise and that key implementation details reside primarily in the appendix. We will expand §4 to include: (i) exact library versions and preprocessing pipelines for each model family, (ii) the hyperparameter search protocol (search spaces, optimizer, number of trials, and early-stopping rules), (iii) the statistical tests used for pairwise comparisons together with the multiple-testing correction applied, and (iv) a concise error-analysis subsection that examines systematic failure modes across model categories. These clarifications will be placed in the main text rather than the appendix, thereby making the robustness of the “no dominant winner” result and the subsequent metafeature mappings fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and analysis

full rationale

The paper introduces OmniTabBench as a large-scale empirical collection of 3030 datasets and performs comparative model evaluations plus metafeature regressions to identify favoring conditions. No derivation chain, equations, or predictions are claimed; the central results (absence of dominant winner, conditions from decoupled metafeatures) are direct outputs of data collection and statistical analysis on observed properties. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The work is self-contained as an empirical study without reducing any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the representativeness of the collected datasets, the accuracy of LLM-driven categorization, and the validity of treating metafeatures as independent axes without hidden interactions.

axioms (2)

domain assumption The 3030 datasets comprehensively represent diverse real-world tabular tasks across industries.
Invoked to support generalizability of the no-dominant-winner and metafeature findings.
ad hoc to paper LLM-based categorization of datasets by industry is sufficiently accurate for analysis purposes.
Used to structure the benchmark and enable industry-level insights.

pith-pipeline@v0.9.0 · 5500 in / 1366 out tokens · 47966 ms · 2026-05-10T18:56:36.602811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Erickson, N., Purucker, L., Tschalzev, A., Holzm¨uller, D., Desai, P. M., Salinas, D., and Hutter, F. Tabarena: A living benchmark for machine learning on tabular data. arXiv preprint arXiv:2506.16791,

work page Pith review arXiv
[2]

Carte: pretraining and transfer for tabular learning.arXiv preprint arXiv:2402.16785, 2024

Kelly, M., Longjohn, R., and Nottingham, K. The uci machine learning repository.https://archive.ics.uci.edu. Kim, M. J., Grinsztajn, L., and Varoquaux, G. Carte: pre- training and transfer for tabular learning.arXiv preprint arXiv:2402.16785,

work page arXiv
[3]

In terms of ResNet and FT-Transformer, we adopt the implementations provided by Gorishniy et al

to ensure reproducibility. In terms of ResNet and FT-Transformer, we adopt the implementations provided by Gorishniy et al. (2021) from the official repository (https://github.com/yandex-research/rtdl-revisiting-models ), as these architectures represent the state-of-the-art in deep learning for tabular and structured data. For the RealMLP model, we utili...

work page 2021
[4]

About Dataset

with stratified sampling to preserve the original class distribution across both subsets. In the case ofRegressiontasks, the system incorporates a specialized preprocessing step to handle numerical strings—automatically removing characters such as commas and percentage signs to convert targets into floating-point numbers—before executing a standard 80/20 ...

work page 2004