pith. machine review for the scientific record. sign in

arxiv: 2207.01848 · v6 · submitted 2022-07-05 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:58 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords tabular classificationtransformerin-context learningprior-data fitted networkBayesian inferencestructural causal modelssmall datasetsAutoML
0
0 comments X

The pith

A pre-trained Transformer performs competitive classification on small tabular datasets in under a second with no tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TabPFN, a Transformer pre-trained offline to approximate Bayesian inference for supervised classification on small tabular datasets. It accepts training and test samples together as input and produces predictions for the full test set in one forward pass by performing in-context learning from the labeled examples provided. The training uses synthetic datasets drawn from a prior over structural causal models that favors simple structures. A sympathetic reader would care because the approach promises to deliver results competitive with boosted trees and AutoML systems while eliminating hyperparameter search and long training times for common small-data tasks.

Core claim

TabPFN is a Prior-Data Fitted Network, a Transformer trained once to approximate Bayesian inference on synthetic datasets drawn from a prior over structural causal models with a preference for simple structures. It performs in-context learning on sequences of labeled examples, with all learning entailed in the network weights, accepting training and test samples as set-valued input and yielding predictions for the entire test set in a single forward pass. On the 18 datasets in the OpenML-CC18 suite that contain up to 1000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, TabPFN clearly outperforms boosted trees and performs on par with复杂

What carries the argument

Prior-Data Fitted Network (PFN) trained to approximate Bayesian inference on synthetic data from a structural causal model prior, enabling in-context learning for tabular classification.

Load-bearing premise

The prior over structural causal models used to generate the synthetic training data is sufficiently representative of the distribution of real-world small tabular classification problems.

What would settle it

Evaluating TabPFN on the 18 OpenML-CC18 datasets with up to 1000 training points, 100 numerical features, and up to 10 classes and finding that it does not outperform boosted trees or match AutoML performance would falsify the central claim.

read the original abstract

We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples (x, f(x)) given in the input, without requiring further parameter updates. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$\times$ speedup. This increases to a 5 700$\times$ speedup when using a GPU. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TabPFN, a Transformer-based Prior-Data Fitted Network (PFN) trained offline once on synthetic datasets drawn from an explicit prior over structural causal models. The model performs in-context learning for small tabular classification by accepting training and test points as set-valued input and producing predictions in a single forward pass, with no further parameter updates or hyperparameter tuning required. On a filtered subset of 18 OpenML-CC18 datasets (≤1000 points, ≤100 numerical features, no missing values, ≤10 classes) plus 67 additional small numerical OpenML datasets, the authors claim clear outperformance over gradient-boosted trees, parity with state-of-the-art AutoML systems, and speedups of up to 230× (5700× on GPU).

Significance. If the empirical claims hold, the work is significant because it demonstrates that a single pre-trained network can approximate Bayesian inference under a causal prior for tabular data, delivering AutoML-level accuracy at inference speeds that are orders of magnitude faster and without any per-dataset tuning. This could materially change practice for the large class of small tabular problems that dominate many applied domains.

major comments (3)
  1. [Results section] Results section (tables reporting performance on the 18 OpenML-CC18 datasets): the headline claims of outperformance over boosted trees and parity with AutoML lack error bars, standard deviations across runs, or statistical significance tests, making it impossible to judge whether the observed differences are reliable or could be explained by dataset-specific variance.
  2. [Method section] Method section describing the prior (the construction of the structural causal model prior used to generate synthetic training data): the paper provides no ablation of the prior components (e.g., preference for simple structures, choice of causal mechanisms) and no sensitivity analysis showing how performance changes when these choices are varied, which is load-bearing for the claim that the learned ICL procedure generalizes.
  3. [Evaluation / Experiments] No section reports a direct distributional comparison (e.g., MMD, moment matching, or feature-interaction statistics) between samples drawn from the synthetic SCM prior and the 18 filtered real OpenML test sets. Without such evidence the central generalization assumption—that the prior is sufficiently representative of the target distribution—remains untested and could explain the observed transfer performance as an artifact of the particular dataset filter rather than a robust property of the method.
minor comments (2)
  1. [Abstract / Experiments] The exact conditions under which the 230× and 5700× speedups are measured (hardware, batching, comparison baseline implementation) are stated only in the abstract and should be repeated with precise timing methodology in the main text or appendix.
  2. [Figures and Tables] Figure legends and table captions would benefit from explicit statements of the number of datasets, the exact filtering criteria, and whether the reported metrics are averages or per-dataset values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to improve the clarity and robustness of the manuscript.

read point-by-point responses
  1. Referee: [Results section] Results section (tables reporting performance on the 18 OpenML-CC18 datasets): the headline claims of outperformance over boosted trees and parity with AutoML lack error bars, standard deviations across runs, or statistical significance tests, making it impossible to judge whether the observed differences are reliable or could be explained by dataset-specific variance.

    Authors: We agree that error bars and statistical tests are important for assessing reliability. In the revised manuscript we will add standard deviations computed over multiple random seeds to the performance tables and include pairwise statistical significance tests (e.g., Wilcoxon signed-rank test with p-values) comparing TabPFN against the baselines. revision: yes

  2. Referee: [Method section] Method section describing the prior (the construction of the structural causal model prior used to generate synthetic training data): the paper provides no ablation of the prior components (e.g., preference for simple structures, choice of causal mechanisms) and no sensitivity analysis showing how performance changes when these choices are varied, which is load-bearing for the claim that the learned ICL procedure generalizes.

    Authors: We acknowledge that explicit ablations would strengthen the justification of the prior design. We will add a sensitivity analysis in the appendix that varies key components such as the structural simplicity bias and causal mechanism choices, reporting their effect on downstream classification performance on the evaluation datasets. revision: yes

  3. Referee: [Evaluation / Experiments] No section reports a direct distributional comparison (e.g., MMD, moment matching, or feature-interaction statistics) between samples drawn from the synthetic SCM prior and the 18 filtered real OpenML test sets. Without such evidence the central generalization assumption—that the prior is sufficiently representative of the target distribution—remains untested and could explain the observed transfer performance as an artifact of the particular dataset filter rather than a robust property of the method.

    Authors: We agree that a direct comparison would better support the generalization claim. We will add a new subsection (or appendix) that reports distributional comparisons, including maximum mean discrepancy (MMD) and selected moment and interaction statistics, between samples from the synthetic SCM prior and the real OpenML datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity: training on synthetic prior, evaluation on held-out real data

full rationale

The paper trains TabPFN offline on synthetic datasets drawn from an explicit prior over structural causal models to approximate Bayesian inference, then evaluates generalization on filtered real OpenML-CC18 datasets (and an additional 67 datasets). The headline performance claims (outperformance of boosted trees, parity with AutoML) are measured on these held-out real instances and are not forced by construction from the training inputs. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central empirical result to the authors' own prior work appear in the provided text. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an unstated but load-bearing prior over structural causal models whose exact parameterization and sampling procedure are not detailed in the abstract; no free parameters are explicitly named, but the prior itself functions as the main modeling choice.

axioms (1)
  • domain assumption A broad space of structural causal models with preference for simple structures generates synthetic data whose distribution is close enough to real tabular classification problems for the trained network to generalize.
    Invoked in the abstract when describing the prior used to train the PFN.

pith-pipeline@v0.9.0 · 5599 in / 1426 out tokens · 40877 ms · 2026-05-15T02:58:24.217861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Privacy Auditing with Zero (0) Training Run

    cs.CR 2026-05 unverdicted novelty 8.0

    Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

  2. FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.

  3. Quantifying the Risk-Return Tradeoff in Forecasting

    econ.EM 2026-05 unverdicted novelty 7.0

    Forecast loss differentials are reframed as returns and assessed with risk-adjusted finance metrics, showing professional forecasters are harder to beat on risk-adjusted performance than on raw accuracy in US macro fo...

  4. Data Language Models: A New Foundation Model Class for Tabular Data

    cs.AI 2026-05 unverdicted novelty 7.0

    Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.

  5. TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...

  6. PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals

    q-fin.PR 2026-05 unverdicted novelty 7.0

    PHBench shows Product Hunt launch signals predict Series A funding with an ensemble model reaching AP 0.037 and F0.5 0.097 on blind test data, outperforming logistic regression and zero-shot LLMs.

  7. Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

    cs.LG 2026-04 unverdicted novelty 7.0

    Time series foundation models match the performance of specialized models for day-ahead load forecasting while providing explanations that match domain knowledge on weather and calendar effects.

  8. Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models

    cs.LG 2026-04 unverdicted novelty 7.0

    TabDistill distills feature interactions from tabular foundation models via post-hoc attribution and inserts them into GAMs, yielding consistent predictive gains.

  9. Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

    cs.CL 2026-04 unverdicted novelty 7.0

    The authors release the first Slovene ESG sentiment dataset from news and report that large language models lead on environmental and social classification while fine-tuned SloBERTa performs best on governance.

  10. Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

    cs.CL 2026-03 unverdicted novelty 7.0

    RCT couples an LLM and Random Forest via RL feedback so each augments the other's features and rewards, producing consistent gains on three medical datasets.

  11. TabPFN-3: Technical Report

    cs.LG 2026-05 unverdicted novelty 6.0

    TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.

  12. LGB+: A Macroeconomic Forecasting Road Test

    econ.EM 2026-05 unverdicted novelty 6.0

    LGB+ improves macroeconomic forecasts by letting linear basis functions compete with or alternate against tree updates inside gradient boosting, yielding native linear/nonlinear decomposition of predictions.

  13. CarCrashNet: A Large-Scale Dataset and Hierarchical Neural Solver for Data-Driven Structural Crash Simulation

    cs.LG 2026-05 unverdicted novelty 6.0

    CarCrashNet releases a large-scale open benchmark dataset of structural crash simulations and a hierarchical neural solver for data-driven full-vehicle crash prediction.

  14. ModelLens: Finding the Best for Your Task from Myriads of Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

  15. Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors

    stat.ML 2026-05 conditional novelty 6.0

    Decoupled PFNs use controllable synthetic priors to train separate latent-signal and noise heads, making epistemic-aleatoric decomposition identifiable and improving acquisition in noisy settings.

  16. TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 6.0

    TFM-Retouche is an input-space residual adapter that lifts TabICLv2 performance by 56 Elo points on 51 tabular datasets while remaining architecture-agnostic and computationally light.

  17. Tabular foundation models for in-context prediction of molecular properties

    cs.LG 2026-04 unverdicted novelty 6.0

    Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.

  18. ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

    cs.AI 2026-04 unverdicted novelty 6.0

    ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.

  19. From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Spline encodings for numerical features show task-dependent performance in tabular deep learning, with piecewise-linear encoding robust for classification and variable results for regression depending on spline family...

  20. TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    cs.LG 2025-11 unverdicted novelty 6.0

    TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast productio...

  21. Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

    cs.AI 2026-04 unverdicted novelty 4.0

    TabPFN reaches AUC 0.892 for 3-year MCI-to-AD conversion on TADPOLE data and holds performance at N=50 training samples where XGBoost, Random Forest, LightGBM, and logistic regression degrade.

  22. Optimizing IoT Intrusion Detection with Tabular Foundation Models for Smart City Forensics

    cs.CR 2026-04 unverdicted novelty 4.0

    TabPFNv2.5 delivers 40x faster inference than Random Forest at 97% binary accuracy on TON IoT data, enabling a hybrid pipeline for real-time IoT threat screening in smart cities.

  23. Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms

    cs.LG 2026-04 unverdicted novelty 4.0

    TabPFN maintains high ROC-AUC and structured attention under controlled additions of irrelevant features, nonlinear correlations, and mislabeled targets in binary classification.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 22 Pith papers · 1 internal anchor

  1. [1]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv:2004.05150 [cs.CL],

  2. [2]

    Borisov, T

    V . Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci. Deep neural networks and tabular data: A survey. arXiv:2110.01889 [cs.LG],

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

  4. [4]

    URL https://proceedings.neurips.cc/paper_files/paper/ 2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mini...

  5. [5]

    T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. Big self-supervised models are strong semi-supervised learners. arXiv:2006.10029v2 [cs.LG],

  6. [6]

    Autogluon-tabular: Robust and accurate automl for structured data,

    URL http://archive. ics.uci.edu/ml. N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluon-tabular: Robust and accurate automl for structured data. arXiv:2003.06505 [stat.ML],

  7. [7]

    12 Published as a conference paper at ICLR 2023 M

    Available for free at http://automl.org/book. 12 Published as a conference paper at ICLR 2023 M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Proceedings of the 28th International Conference on Advances i...

  8. [8]

    Feurer, K

    M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter. Auto-sklearn 2.0: Hands-free AutoML via meta-learning. arXiv:2007.04074 [cs.LG],

  9. [9]

    Grinsztajn, E

    L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data? arXiv:2207.08815 [cs.LG],

  10. [10]

    URL https://arxiv.org/abs/2204.04875. D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the Interna- tional Conference on Learning Representations (ICLR’15),

  11. [11]

    13 Published as a conference paper at ICLR 2023 S

    Published online: iclr.cc. 13 Published as a conference paper at ICLR 2023 S. Müller, N. Hollmann, S. Arango, J. Grabocka, and F. Hutter. Transformers can do bayesian inference. In Proceedings of the International Conference on Learning Representations (ICLR’22),

  12. [12]

    Rothenhäusler, N

    D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters. Anchor regression: heterogeneous data meets causality. arXiv:1801.06229 [stat.ME],

  13. [13]

    SAINT: Improved neural networks for tabular data,

    G. Somepalli, M. Goldblum, A. Schwarzschild, C. Bruss, and T. Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv:2106.01342 [cs.LG],

  14. [14]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research , 15: 1929–1958,

  15. [15]

    Accessed: 2022-09-28. J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine learning. SIGKDD Explorations, 15(2):49–60,

  16. [16]

    Vaswani, N

    14 Published as a conference paper at ICLR 2023 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Proceedings of the 30th International Conference on Advances in Neural Informa...

  17. [17]

    Boosting ensemble accuracy by revisiting ensemble diversity metrics

    Yanzhao Wu, Ling Liu, Zhongwei Xie, Ka-Ho Chow, and Wenqi Wei. Boosting ensemble accuracy by revisiting ensemble diversity metrics. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16464–16472,

  18. [18]

    doi: 10.1109/CVPR46437.2021.01620. I. Yeo and R. Johnson. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4):954–959,

  19. [19]

    Published online: iclr.cc. 15 Published as a conference paper at ICLR 2023 A L IMITATIONS The runtime and memory usage of the Transformer-based PFN architecture used in this work scales quadratically with the number of inputs, i.e., training samples passed. Thus, inference on larger sequences (> 100

  20. [20]

    A growing number of methods seek to tackle this issue and report similar performances while scaling linearly with the number of inputs (Zaheer et al., 2020; Beltagy et al., 2020)

    is hard on current consumer GPUs. A growing number of methods seek to tackle this issue and report similar performances while scaling linearly with the number of inputs (Zaheer et al., 2020; Beltagy et al., 2020). These methods can be integrated into the PFN architecture and thus into the TabPFN. Furthermore, in our experiments we limit the number of feat...

  21. [21]

    16 Published as a conference paper at ICLR 2023 1s 5s 30s 5min 1h 0.84 0.85 0.86 0.87 0.88 0.89 0.9 Given Time Budget Mean ROC AUC 1s 5s 30s 5min 1h 0 5 10 15 20 25 Given Time Budget ROC AUC Wins 1s 5s 30s 5min 1h 1 2 3 4 5 6 7 8 9 10 Given Time Budget Mean ROC AUC Rank TabPFN KNN Reg. Cocktails Log Catboost SAINT LightGBM XGBoost Autogluon Auto-sklearn 2...

  22. [22]

    credit-g

    We note that these datasets are not disjoint from our evaluation datasets; in fact 3 of these (“credit-g”, “vehicle”, and “blood-transfusion-service-center”) were included in our evaluation datasets from the OpenML-CC18 Benchmark as well, while one dataset (“Australian”) was included in our list of 150 meta-validation datasets. The evaluation using anothe...

  23. [23]

    If available, all baselines are given ROC AUC optimization as an objective, others optimize CE

    This is due to our model’s prior, which prefers 8Available at https://github.com/openml/automlbenchmark 17 Published as a conference paper at ICLR 2023 Table 2: ROC AUC OVO results on the30 small OpenML-CC18 (including datasets with categorical features and missing values) for 60 minutes requested time per dataset and per split. If available, all baseline...

  24. [24]

    The number of training samples and the complexity of the generated data (number of hidden units in the data generating graph) is varied

    Here, we evaluate the training set cross-entropy loss on synthetic data generated from random SCMs. The number of training samples and the complexity of the generated data (number of hidden units in the data generating graph) is varied. B.3.2 R OBUSTNESS TO UNINFORMATIVE FEATURES Tabular datasets contain a large fraction of uninformative features (Grinszt...

  25. [25]

    We find that TabPFN and MLPs are less robust to uninformative features than LightGBM

    Uninformative features are generated by copying existing features and shuffling their values randomly between samples. We find that TabPFN and MLPs are less robust to uninformative features than LightGBM. TabPFN could be adapted by including more uninformative features in the used prior. In a second experiment we drop an increasingly large fraction of fea...

  26. [26]

    in order to assess the generality of our results and to better understand 9The dataset f lags was removed as not enough splits could be generated by our code. 20 Published as a conference paper at ICLR 2023 1000 2000 3000 4000 5000 0.73 0.74 0.75 0.76 0.77 0.78 0.79 Training samples ROC AUC Mean Figure 10: Extrapolation performance of our TabPFN to datase...

  27. [27]

    We now also include comparisons with default random forests (RFs), support vector machines (SVMs), default XGBoost, etc

    its strengths and weaknesses. We now also include comparisons with default random forests (RFs), support vector machines (SVMs), default XGBoost, etc. , along with their tuned versions after one hour. This evaluation used 5 splits and followed the same experimental setup described in Appendix F. Figure 12 shows results for our 30 test datasets (separated ...

  28. [28]

    Baselines were tuned for one hour or until 10000 configurations were exhausted (Log

    default XGB tuned XGB Figure 12: ROC AUC comparison on the OpenML-CC18 Benchmark. Baselines were tuned for one hour or until 10000 configurations were exhausted (Log. Reg and KNN). introduced by Demšar (2006). We use the Wilcoxon rank test and correct for multiple testing using Holm–Bonferroni method. We use a a significance level alpha = 0.05 for our cri...

  29. [29]

    Baselines were tuned for one hour or until 10 000 configurations were exhausted (Log

    default XGB tuned XGB Figure 13: ROC AUC comparison on 149 validation datasets (see Table 8). Baselines were tuned for one hour or until 10 000 configurations were exhausted (Log. Reg and KNN). 23 Published as a conference paper at ICLR 2023 Fast Baselines 12345 3.48 default SVM 3.37 default XGB 3.17 default GP (RBF) 2.87 default LGBM 2.11 T abPFN Numeric...

  30. [30]

    TabPFN still works well in several datasets where hyperparameter optimization does not help in the baselines (maybe due to overfitting), e.g. Pizza-cutter1 and arsenic-female-bladder, 25 Published as a conference paper at ICLR 2023 (a) Touch-2, purely categorical (b) vehicle Figure 17: Strong results for TabPFN for individual datasets with numerical featu...

  31. [31]

    It is therefore necessary to look beyond individual datasets to obtain the entire picture. C D ETAILS OF THE TABPFN P RIOR C.1 SCM P RIOR The Sampling Algorithm We instantiate a subfamily of DAGs that can be efficiently sampled from by starting with a MLP architecture and dropping weights from it. That is, to sample a dataset with k features and n samples...

  32. [32]

    Blockwise feature sampling

    C.2 T ABULAR DATA REFINEMENTS Tabular datasets comprise a range of peculiarities, e.g. feature types can be numerical, ordinal, or categorical and feature values can be missing, leading to sparse features. We seek to reflect these peculiarities in the design of our prior as described in the following sections. C.2.1 P REPROCESSING During prior-fitting, in...

  33. [33]

    D D ETAILS OF THE PRIOR -DATA FITTED NETWORK ALGORITHM Algorithm 1 describes the training method proposed by Müller et al

    This improvesperformance by 2% in this smaller scale setup, which is a bigger difference than between the performance of the final TabPFN and all baselines besides KNN and SAINT. D D ETAILS OF THE PRIOR -DATA FITTED NETWORK ALGORITHM Algorithm 1 describes the training method proposed by Müller et al. (2022) for PFNs. Algorithm 1: Prior-fitting of a PFN (M...

  34. [34]

    For each training we tested a set of 3 learning rates, {.001, .0003, .0001}, and used the one with the lowest final training loss

    with linear-warmup and cosine annealing (Loshchilov and Hutter, 2017). For each training we tested a set of 3 learning rates, {.001, .0003, .0001}, and used the one with the lowest final training loss. The resulting model contains 25.82 M parameters. E.2 PFN A RCHITECTURE ADAPTATIONS Attention Adaption The original PFN architecture (Müller et al.,

  35. [35]

    to compute the attention between all the training examples, as well as, the attention from validation examples to training examples. We replaced this, with two 10To avoid misunderstandings and give an example, a rotation of column indices by 2 positions would change the columns of the X matrix from [x1, x2, x3, x4] to [x3, x4, x1, x2]. This is not a rotat...

  36. [36]

    the magnitude stays the same

    Our encoder changes to accomodate this training and inference with different numbers of features by zero-padding datasets where the number of features k is smaller than the maximum number of features K and scaling these features by K k , s.t. the magnitude stays the same. E.3 T ABPFN T RAINING We trained our final model for18 000 steps with a batch size o...

  37. [37]

    F.1 H ARDWARE SETUP All evaluations, including the baselines, ran on a compute cluster equipped with Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz using 1 CPU with up to 6GB RAM

    F D ETAILS FOR TABULAR EXPERIMENTS Here we provide additional details for the experiments conducted in Section 5 in the main paper. F.1 H ARDWARE SETUP All evaluations, including the baselines, ran on a compute cluster equipped with Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz using 1 CPU with up to 6GB RAM. For evaluation using our TabPFN, we additionally us...

  38. [38]

    ForCatBoost and XGBoost, we used the same ranges as Shwartz-Ziv and Armon (2022) with the following exception: For CatBoost we removed the hyperparameter max_size since we could not find it in the official documentation. To be maximally fair to XGBoost, we also tried the search space of quadruple Kaggle grandmaster Bojan Tunguz (Tunguz, 2022), which we ad...

  39. [39]

    Second, our meta-validation set (see Tables 8 and

    comprises all datasets in the OpenML-CC18 benchmark suite (Bis- chl et al., 2021)(available at OpenML.org) with at most 2 000 samples, 100 features and 10 classes, which leaves us with 30 datasets that represent small, tabular datasets. Second, our meta-validation set (see Tables 8 and

  40. [40]

    comprises 150 datasets fromOpenML.org (Van- schoren et al., 2014). For this, we considered all datasets on OpenML.org and applied the following filtering procedure: We dropped all datasets that are in the meta-test set and all datasets with more than 2 000 samples, 100 features or 10 classes. We also manually checked for overlaps and removed datasets wher...

  41. [41]

    When more than 10 classes are contained in the datasets, samples with any but the first 10 classes are discarded

    For this evaluation, datasets with more than 100 features are limited to the first 100 features. When more than 10 classes are contained in the datasets, samples with any but the first 10 classes are discarded. F.5 D ETAILS ON TIME COMPARISONS Time comparisons refer to combined fitting, tuning and prediction; see Table 2 for the times split into tuning/fi...