TabArena: A Living Benchmark for Machine Learning on Tabular Data
Pith reviewed 2026-05-19 08:38 UTC · model grok-4.3
pith:OHUAB6VQ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{OHUAB6VQ}
Prints a linked pith:OHUAB6VQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
TabArena launches as a continuously updated benchmark showing that ensembles across tabular models set new performance records while exposing overfitting in some deep learning approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TabArena is a living tabular benchmarking system that initializes a public leaderboard through manual curation of representative datasets and well-implemented models, followed by large-scale experiments. These experiments show that ensembles across different models improve upon single-model or same-model results to advance the state of the art in tabular machine learning. Gradient-boosted trees remain strong, deep learning catches up under larger time budgets when ensembled, and foundation models excel on smaller datasets, while some deep models overfit validation sets and become overrepresented in cross-model ensembles.
What carries the argument
TabArena, the continuously maintained tabular benchmarking system with curated datasets, models, validation protocols, ensembling procedures, and maintenance team that enables ongoing leaderboard updates.
If this is right
- Validation method choice and ensembling of hyperparameter configurations are required to measure any model's full potential on tabular tasks.
- Cross-model ensembles consistently outperform both individual models and within-model ensembles on the curated collection.
- Deep learning models require additional safeguards against validation-set overfitting to participate reliably in cross-model ensembles.
- Foundation models offer a practical advantage specifically on smaller tabular datasets.
- The public leaderboard and maintenance protocols will allow new models and dataset fixes to be incorporated without resetting the benchmark.
Where Pith is reading between the lines
- Sustained maintenance of TabArena could establish a de facto standard reference for tracking long-term progress in tabular machine learning similar to established benchmarks in other domains.
- Model developers should prioritize techniques that reduce validation overfitting so their contributions remain effective when combined with other models.
- The observed performance patterns suggest that future work may benefit from exploring adaptive ensembles that weight tree-based, deep, and foundation models according to dataset size and characteristics.
- If new data distributions emerge over time, the living nature of the benchmark will make it possible to detect and quantify shifts in which modeling approaches remain effective.
Load-bearing premise
The manually selected collection of datasets accurately represents the practical tabular problems that arise in real applications.
What would settle it
A follow-up study on an independent collection of real-world tabular datasets that produces different performance rankings, particularly reversing the observed benefits of cross-model ensembling or the relative standing of deep learning versus gradient-boosted trees, would falsify the generalizability of TabArena's initial findings.
Figures
read the original abstract
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TabArena as the first continuously maintained living benchmark for tabular machine learning. It describes the manual curation of a representative collection of datasets and well-implemented models, a large-scale benchmarking study that initializes a public leaderboard, and the assembly of a maintenance team. Key empirical observations include the continued competitiveness of gradient-boosted trees on practical datasets, the ability of deep learning methods to match or exceed them under larger time budgets with ensembling, the strength of foundation models on smaller datasets, and the finding that cross-model ensembles advance the state-of-the-art while some deep learning models appear overrepresented in such ensembles due to validation-set overfitting.
Significance. A successfully maintained living benchmark would address a genuine gap in tabular ML evaluation, where static benchmarks rapidly obsolesce. The reported findings on ensembling practices and validation overfitting could usefully guide model developers if the underlying dataset collection is shown to be representative; the provision of reproducible code and maintenance protocols is a concrete strength that supports ongoing community use.
major comments (2)
- [Dataset curation section] Dataset curation section: the claim that the manually curated collection is 'representative' of practical tabular problems is not supported by any quantitative comparison (e.g., Kolmogorov-Smirnov or Earth-mover distance on instance count, feature dimensionality, class imbalance, or domain coverage) against reference corpora such as OpenML or Kaggle. This assumption is load-bearing for the generalization of the ensemble SOTA advancement and the overfitting observations.
- [Benchmarking study and results sections] Benchmarking study and results sections: the abstract and text reference the influence of validation method and ensembling of hyperparameter configurations, yet the manuscript provides insufficient detail on exact dataset selection criteria, validation protocols, and statistical significance testing for the performance claims (e.g., no reported p-values or confidence intervals for cross-model ensemble gains).
minor comments (2)
- [Abstract and maintenance protocols] The abstract states that 'we assemble a team of experienced maintainers' but the main text does not elaborate on their specific roles or expertise, which would strengthen the claim of long-term sustainability.
- [Leaderboard presentation] Leaderboard tables or figures should explicitly report the number of runs, random seeds, and any multiple-testing corrections to allow readers to assess the reliability of the reported rankings.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript introducing TabArena. The comments highlight opportunities to strengthen the justification for our dataset collection and to improve transparency in the benchmarking methodology. We address each major comment below and will incorporate revisions to enhance the paper accordingly.
read point-by-point responses
-
Referee: [Dataset curation section] Dataset curation section: the claim that the manually curated collection is 'representative' of practical tabular problems is not supported by any quantitative comparison (e.g., Kolmogorov-Smirnov or Earth-mover distance on instance count, feature dimensionality, class imbalance, or domain coverage) against reference corpora such as OpenML or Kaggle. This assumption is load-bearing for the generalization of the ensemble SOTA advancement and the overfitting observations.
Authors: We agree that the current manuscript does not include quantitative statistical comparisons to support the representativeness claim. Our curation process relied on expert-driven selection to achieve diversity across dataset sizes, feature dimensionalities, domains, and imbalance levels, informed by practical tabular ML use cases. To address this point directly, the revised manuscript will add a dedicated analysis (likely in an appendix or new subsection) that compares key dataset statistics from our collection against reference corpora such as OpenML and Kaggle, using metrics including Kolmogorov-Smirnov tests and Earth Mover's Distance on instance counts, feature counts, and class imbalance ratios. This addition will provide quantitative grounding for the generalization of our findings on cross-model ensembles and validation overfitting. revision: yes
-
Referee: [Benchmarking study and results sections] Benchmarking study and results sections: the abstract and text reference the influence of validation method and ensembling of hyperparameter configurations, yet the manuscript provides insufficient detail on exact dataset selection criteria, validation protocols, and statistical significance testing for the performance claims (e.g., no reported p-values or confidence intervals for cross-model ensemble gains).
Authors: We concur that additional methodological detail is warranted for reproducibility and to support the performance claims. The revised manuscript will expand the relevant sections to explicitly state the dataset selection criteria (including any quantitative thresholds for size, quality, and diversity), provide precise descriptions of the validation protocols (such as the specific train/validation/test splitting strategies and hyperparameter ensembling procedures), and include statistical significance testing. In particular, we will report p-values and confidence intervals for the gains achieved by cross-model ensembles relative to individual models. These clarifications will be added without altering the core empirical results. revision: yes
Circularity Check
No significant circularity; empirical results on external datasets
full rationale
The paper reports outcomes from large-scale empirical runs of models on a manually curated collection of tabular datasets. All central claims—ensemble gains advancing SOTA, overrepresentation of certain deep learning models due to validation-set overfitting, and relative strengths of gradient-boosted trees versus deep learning under different budgets—are direct observations from these external benchmark executions rather than quantities derived from the paper's own equations or self-referential definitions. No load-bearing step reduces by construction to fitted parameters, self-citations, or ansatzes introduced in prior work by the same authors. The manual curation and living-benchmark protocols are descriptive and do not enter the performance measurements as circular inputs. This is a standard empirical benchmarking study whose results remain falsifiable against independent dataset collections.
Axiom & Free-Parameter Ledger
free parameters (2)
- Dataset curation choices
- Model implementation selections
axioms (1)
- domain assumption The curated datasets represent practical tabular data problems.
invented entities (1)
-
TabArena
no independent evidence
Forward citations
Cited by 19 Pith papers
-
MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
-
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
-
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy
RamanBench unifies 74 datasets into the first large-scale reproducible benchmark for ML on Raman spectra, finding tabular foundation models outperform baselines but no method generalizes across datasets.
-
Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models
TabDistill distills feature interactions from tabular foundation models via post-hoc attribution and inserts them into GAMs, yielding consistent predictive gains.
-
OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale
OmniTabBench shows no single model family dominates tabular tasks and maps performance advantages to specific dataset properties like size and skewness.
-
TS-Arena -- A Live Forecast Pre-Registration Platform
TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.
-
TabPFN-3: Technical Report
TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.
-
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
BoostLLM trains sequential PEFT adapters as weak learners in a residual process, using decision-tree paths as a second input view, to improve few-shot tabular classification over standard LLM fine-tuning and match or ...
-
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
BoostLLM trains sequential PEFT adapters in a boosting framework with tree path inputs to improve LLM performance on few-shot tabular classification, matching or exceeding XGBoost.
-
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
TFM-Retouche is an input-space residual adapter that lifts TabICLv2 performance by 56 Elo points on 51 tabular datasets while remaining architecture-agnostic and computationally light.
-
Tabular foundation models for in-context prediction of molecular properties
Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.
-
Benchmarking Optimizers for MLPs in Tabular Deep Learning
Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
-
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast productio...
-
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 3...
-
TabCF: Distributional Control Function Estimation with Tabular Foundation Models
TabCF is a tuning-light method using tabular foundation models for control function regression to estimate distributional causal effects such as interventional means and quantiles.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms
TabPFN maintains high ROC-AUC and structured attention under controlled additions of irrelevant features, nonlinear correlations, and mislabeled targets in binary classification.
Reference graph
Works this paper leans on
-
[1]
R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In R. Greiner, editor,Proceedings of the 21st International Conference on Machine Learning (ICML’04). Omnipress, 2004
work page 2004
-
[2]
V . Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci. Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022
work page 2022
-
[3]
B. van Breugel and M. van der Schaar. Why tabular foundation models should be a research priority. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning (ICML’24), volume 251 ofProceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[4]
M. Herrmann, F. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix, and B. Bischl. Position: Why we must rethink empirical research in machine learning. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, 11 J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Con...
work page 2024
-
[5]
N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025
work page 2025
-
[6]
Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, and Han-Jia Ye. Representation learning for tabular data: A comprehensive survey.arXiv preprint arXiv:2504.16109, 2025
-
[7]
R. Kohli, M. Feurer, B. Bischl, K. Eggensperger, and F. Hutter. Towards quantifying the effect of datasets for benchmarking: A look at tabular machine learning. InData-centric Machine Learning Research Workshop at the International Conference on Learning Representations, 2024
work page 2024
-
[8]
A data-centric perspective on evaluating machine learning models for tabular data
Andrej Tschalzev, Sascha Marton, Stefan Lüdtke, Christian Bartelt, and Heiner Stuckenschmidt. A data-centric perspective on evaluating machine learning models for tabular data. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[9]
Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024
-
[10]
Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. Tabred: Ana- lyzing pitfalls and filling the gaps in tabular deep learning benchmarks.arXiv preprint arXiv:2406.19380, 2024
-
[11]
Unreflected use of tabular data repositories can undermine research quality
Andrej Tschalzev, Lennart Purucker, Stefan Lüdtke, Frank Hutter, Christian Bartelt, and Heiner Stuckenschmidt. Unreflected use of tabular data repositories can undermine research quality. InThe Future of Machine Learning Data Practices and Repositories at ICLR 2025, 2025
work page 2025
-
[12]
L. Breiman. Random forests. 45:5–32, 2001
work page 2001
- [13]
-
[14]
T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, editors,Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), pages 785–794, 2016
work page 2016
-
[15]
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS’17), 2017
work page 2017
-
[16]
L. Prokhorenkova, G. Gusev, A. V orobev, A. Dorogush, and A. Gulin. Catboost: Unbiased boosting with categorical features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS’18), page 6639–6649, 2018
work page 2018
-
[17]
Accurate intelligible models with pairwise interactions
Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623–631, 2013
work page 2013
-
[18]
Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A unified framework for machine learning interpretability.arXiv preprint arXiv:1909.09223, 2019
-
[19]
AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluon- tabular: Robust and accurate automl for structured data.arXiv:2003.06505 [stat.ML], 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[20]
David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024
work page 2024
-
[21]
Han-Jia Ye, Huai-Hong Yin, and De-Chuan Zhan. Modern neighborhood components analysis: A deep tabular baseline two decades later.arXiv preprint arXiv:2407.03257, 2024
-
[22]
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025. 12
-
[23]
Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024
Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tabdpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164, 2024
-
[24]
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. 12:2825–2830, 2011
work page 2011
-
[25]
Bojan Tunguz, Dieter, Heads or Tails, Karnika Kapoor, Parul Pandey, Paul Mooney, Phil Culliton, Rob Mulla, Sanyam Bhutani, and Will Cukierski. 2023 kaggle ai report, 2023. URL https://kaggle.com/competitions/2023-kaggle-ai-report
work page 2023
-
[26]
Carte: Pretraining and transfer for tabular learning
Myung Jun Kim, Leo Grinsztajn, and Gael Varoquaux. Carte: Pretraining and transfer for tabular learning. InInternational Conference on Machine Learning, pages 23843–23866. PMLR, 2024
work page 2024
-
[27]
Neural network ensembles, cross valida- tion, and active learning
Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross valida- tion, and active learning. In G. Tesauro, D. Touretzky, and T. Leen, edi- tors,Advances in Neural Information Processing Systems, volume 7. MIT Press,
-
[28]
URL https://proceedings.neurips.cc/paper_files/paper/1994/file/ b8c37e33defde51cf91e1e03e51657da-Paper.pdf
work page 1994
- [29]
- [30]
-
[31]
Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko. Revisiting deep learning models for tabular data. In M. Ranzato, A. Beygelzimer, K. Nguyen, P. Liang, J. Vaughan, and Y . Dauphin, editors,Proceedings of the 34th International Conference on Advances in Neural Information Processing Systems (NeurIPS’21), pages 18932–18943, 2021
work page 2021
-
[32]
Tabular data: Deep learning is not all you need
Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022
work page 2022
-
[33]
L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems (NeurIPS’22), 2022
work page 2022
-
[34]
D. McElfresh, S. Khandagale, J. Valverde, V . Prasad C, G. Ramakrishnan, M. Goldblum, and C. White. When do neural nets outperform boosted trees on tabular data? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems (NeurIPS’23),...
work page 2023
-
[35]
S. Fischer, L. Harutyunyan, M. Feurer, and B. Bischl. Openml-ctr23 – a curated tabular regression benchmarking suite. In A. Faust, C. White, F. Hutter, R. Garnett, and J. Gardner, editors,Second International Conference on Automated Machine Learning - Workshop Track, 2023
work page 2023
-
[36]
P. Gijsbers, M. Bueno, S. Coors, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. Vanschoren. Amlb: an automl benchmark. 25(101):1–65, 2024
work page 2024
-
[37]
A closer look at deep learning on tabular data.arXiv preprint arXiv:2407.00956, 2024
Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning on tabular data.arXiv preprint arXiv:2407.00956, 2024
-
[38]
Tabrepo: A large scale repository of tabular model evalua- tions and its automl applications
David Salinas and Nick Erickson. Tabrepo: A large scale repository of tabular model evalua- tions and its automl applications. InAutoML Conference 2024 (ABCD Track), 2024
work page 2024
-
[39]
The proposed uscf rating system, its development, theory, and applications
Arpad E Elo. The proposed uscf rating system, its development, theory, and applications. Chess life, 22(8):242–247, 1967
work page 1967
-
[40]
Chatbot 13 arena: An open platform for evaluating llms by human preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot 13 arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[41]
J. Maier, F. Möller, and L. Purucker. Hardware aware ensemble selection for balancing predictive accuracy and cost. In M. Lindauer, K. Eggensperger, R. Garnett, J. Vanschoren, and J. Gardner, editors,Third International Conference on Automated Machine Learning - Workshop Track, 2024
work page 2024
- [42]
-
[43]
Overtuning in hyperparameter opti- mization
Lennart Schneider, Bernd Bischl, and Matthias Feurer. Overtuning in hyperparameter opti- mization. InInternational Conference on Automated Machine Learning, 2025
work page 2025
-
[44]
L. Purucker, L. Schneider, M. Anastacio, J. Beel, B. Bischl, and H. Hoos. Q(d)o-es: Population- based quality (diversity) optimisation for post hoc ensemble selection in automl. In A. Faust, C. White, F. Hutter, R. Garnett, and J. Gardner, editors,Proceedings of the Second International Conference on Automated Machine Learning. Proceedings of Machine Lear...
work page 2023
-
[45]
L. Purucker and J. Beel. Cma-es for post hoc ensembling in automl: A great success and salvageable failure. In A. Faust, C. White, F. Hutter, R. Garnett, and J. Gardner, editors, Proceedings of the Second International Conference on Automated Machine Learning. Pro- ceedings of Machine Learning Research, 2023
work page 2023
-
[46]
Manu Joseph. Pytorch tabular: A framework for deep learning with tabular data.arXiv preprint arXiv:2104.13638, 2021
- [47]
-
[48]
Joseph D Romano, Trang T Le, William La Cava, John T Gregg, Daniel J Goldberg, Praneel Chakraborty, Natasha L Ray, Daniel Himmelstein, Weixuan Fu, and Jason H Moore. Pmlb v1.0: an open-source dataset collection for benchmarking machine learning methods.Bioinformatics, 38(3):878–880, 2022
work page 2022
-
[49]
Pmlbmini: A tabular classification bench- mark suite for data-scarce applications
Ricardo Knauer, Marvin Grimm, and Erik Rodner. Pmlbmini: A tabular classification bench- mark suite for data-scarce applications. InAutoML Conference 2024 (ABCD Track), 2024
work page 2024
-
[50]
Guri Zabërgja, Arlind Kadra, Christian Frey, and Josif Grabocka. Is deep learning finally better than decision trees on tabular data?arXiv preprint arXiv:2402.03970, 2024
-
[51]
Talent: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057, 2024
Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057, 2024
-
[52]
Tabr: Tabular deep learning meets nearest neighbors
Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii, Akim Kotelnikov, and Artem Babenko. Tabr: Tabular deep learning meets nearest neighbors. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[53]
Rewardbench: Evaluating reward models for language modeling
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024
-
[54]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
A framework for few-shot language model evaluation, September 2021
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URLhttps://doi.org/10.5281/zenodo.5371628
-
[56]
Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024
work page 2024
-
[57]
Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024. 14
-
[58]
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V . V oleti, S. Ebrahimi Kahou, V . Michalski, T. Arbel, C. Pal, G. Varoquaux, and P. Vincent. Accounting for variance in machine learning benchmarks. In A. Smola, A. Dimakis, and I. Stoica, editors,Proceedings of Machine Learning and S...
work page 2021
-
[59]
J. Demšar. Statistical comparisons of classifiers over multiple data sets. 7:1–30, 2006
work page 2006
-
[60]
S. Herbold. Autorank: A python package for automated ranking of classifiers.Journal of Open Source Software, 5(48):2173–2173, 2020
work page 2020
-
[61]
Han-Jia Ye, Si-Yang Liu, and Wei-Lun Chao. A closer look at tabpfn v2: Strength, limitation, and extension.arXiv preprint arXiv:2502.17361, 2025
- [62]
-
[63]
Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025
Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025
work page 2025
-
[64]
Turl: Table understanding through representation learning.ACM SIGMOD Record, 51(1):33–40, 2022
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. Turl: Table understanding through representation learning.ACM SIGMOD Record, 51(1):33–40, 2022
work page 2022
-
[65]
Madelon Hulsebos, Çagatay Demiralp, and Paul Groth. Gittables: A large-scale corpus of relational tables.Proceedings of the ACM on Management of Data, 1(1):1–17, 2023
work page 2023
-
[66]
I-Cheng Yeh, King-Jang Yang, and Tao-Ming Ting. Knowledge discovery on rfm model using bernoulli sequence.Expert Systems with applications, 36(3):5866–5871, 2009
work page 2009
-
[67]
Using the adap learning algorithm to forecast the onset of diabetes mellitus
Jack W Smith, James E Everhart, William C Dickson, William C Knowler, and Robert Scott Johannes. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, page 261, 1988
work page 1988
-
[68]
Unknown. Annealing. https://doi.org/10.24432/C5RW2F, 1990. UCI Machine Learning Repository
-
[69]
Matteo Cassotti, Davide Ballabio, Roberto Todeschini, and Viviana Consonni. A similarity- based qsar model for predicting acute toxicity towards the fathead minnow (pimephales promelas).SAR and QSAR in Environmental Research, 26(3):217–243, 2015
work page 2015
-
[70]
H. Hofmann. Statlog (german credit data) [dataset]. https://doi.org/10.24432/C5NC77,
-
[72]
Marzia Ahmed, Mohammod Abul Kashem, Mostafijur Rahman, and Sabira Khatun. Review and analysis of risk factor of maternal health in remote area using the internet of things (iot). InInECCE2019: Proceedings of the 5th International Conference on Electrical, Control & Computer Engineering, Kuantan, Pahang, Malaysia, 29th July 2019, pages 357–365. Springer, 2020
work page 2019
-
[73]
Modeling of strength of high-performance concrete using artificial neural networks
I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 28(12):1797–1808, 1998
work page 1998
-
[74]
Quantitative structure–activity relationship models for ready biodegradability of chemicals
Kamel Mansouri, Tine Ringsted, Davide Ballabio, Roberto Todeschini, and Viviana Consonni. Quantitative structure–activity relationship models for ready biodegradability of chemicals. Journal of chemical information and modeling, 53(4):867–878, 2013
work page 2013
-
[75]
Kaggle User Arunjangir245. Healthcare insurance expenses. https://www.kaggle.com/ datasets/arunjangir245/healthcare-insurance-expenses/, 2023. Kaggle dataset
work page 2023
-
[76]
Neda Abdelhamid, Aladdin Ayesh, and Fadi Thabtah. Phishing detection based associative classification data mining.Expert Systems with Applications, 41(13):5948–5959, 2014
work page 2014
-
[77]
Fitness club dataset for ml classification
Kaggle User Ddosad. Fitness club dataset for ml classification. https://www.kaggle.com/ datasets/ddosad/datacamps-data-science-associate-certification , 2023. Kaggle dataset
work page 2023
-
[78]
Airfoil self-noise and prediction
Thomas F Brooks, D Stuart Pope, and Michael A Marcolini. Airfoil self-noise and prediction. Technical report, 1989. 15
work page 1989
-
[79]
Another dataset on used fiat 500 (1538 rows).https://www.kaggle
Kaggle User Paolocons. Another dataset on used fiat 500 (1538 rows).https://www.kaggle. com/datasets/paolocons/another-fiat-500-dataset-1538-rows , 2020. Kaggle dataset
work page 2020
-
[80]
Sergey E Golovenkin, Jonathan Bac, Alexander Chervov, Evgeny M Mirkes, Yuliya V Orlova, Emmanuel Barillot, Alexander N Gorban, and Andrei Zinovyev. Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data.GigaScience, 9(11):giaa128, 2020
work page 2020
-
[81]
Is this a good customer? https://www.kaggle.com/datasets/ podsyp/is-this-a-good-customer, 2020
Kaggle User Podsyp. Is this a good customer? https://www.kaggle.com/datasets/ podsyp/is-this-a-good-customer, 2020. Kaggle dataset
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.