Recognition: 2 theorem links
· Lean TheoremTabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Pith reviewed 2026-05-15 02:58 UTC · model grok-4.3
The pith
A pre-trained Transformer performs competitive classification on small tabular datasets in under a second with no tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TabPFN is a Prior-Data Fitted Network, a Transformer trained once to approximate Bayesian inference on synthetic datasets drawn from a prior over structural causal models with a preference for simple structures. It performs in-context learning on sequences of labeled examples, with all learning entailed in the network weights, accepting training and test samples as set-valued input and yielding predictions for the entire test set in a single forward pass. On the 18 datasets in the OpenML-CC18 suite that contain up to 1000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, TabPFN clearly outperforms boosted trees and performs on par with复杂
What carries the argument
Prior-Data Fitted Network (PFN) trained to approximate Bayesian inference on synthetic data from a structural causal model prior, enabling in-context learning for tabular classification.
Load-bearing premise
The prior over structural causal models used to generate the synthetic training data is sufficiently representative of the distribution of real-world small tabular classification problems.
What would settle it
Evaluating TabPFN on the 18 OpenML-CC18 datasets with up to 1000 training points, 100 numerical features, and up to 10 classes and finding that it does not outperform boosted trees or match AutoML performance would falsify the central claim.
read the original abstract
We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples (x, f(x)) given in the input, without requiring further parameter updates. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$\times$ speedup. This increases to a 5 700$\times$ speedup when using a GPU. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TabPFN, a Transformer-based Prior-Data Fitted Network (PFN) trained offline once on synthetic datasets drawn from an explicit prior over structural causal models. The model performs in-context learning for small tabular classification by accepting training and test points as set-valued input and producing predictions in a single forward pass, with no further parameter updates or hyperparameter tuning required. On a filtered subset of 18 OpenML-CC18 datasets (≤1000 points, ≤100 numerical features, no missing values, ≤10 classes) plus 67 additional small numerical OpenML datasets, the authors claim clear outperformance over gradient-boosted trees, parity with state-of-the-art AutoML systems, and speedups of up to 230× (5700× on GPU).
Significance. If the empirical claims hold, the work is significant because it demonstrates that a single pre-trained network can approximate Bayesian inference under a causal prior for tabular data, delivering AutoML-level accuracy at inference speeds that are orders of magnitude faster and without any per-dataset tuning. This could materially change practice for the large class of small tabular problems that dominate many applied domains.
major comments (3)
- [Results section] Results section (tables reporting performance on the 18 OpenML-CC18 datasets): the headline claims of outperformance over boosted trees and parity with AutoML lack error bars, standard deviations across runs, or statistical significance tests, making it impossible to judge whether the observed differences are reliable or could be explained by dataset-specific variance.
- [Method section] Method section describing the prior (the construction of the structural causal model prior used to generate synthetic training data): the paper provides no ablation of the prior components (e.g., preference for simple structures, choice of causal mechanisms) and no sensitivity analysis showing how performance changes when these choices are varied, which is load-bearing for the claim that the learned ICL procedure generalizes.
- [Evaluation / Experiments] No section reports a direct distributional comparison (e.g., MMD, moment matching, or feature-interaction statistics) between samples drawn from the synthetic SCM prior and the 18 filtered real OpenML test sets. Without such evidence the central generalization assumption—that the prior is sufficiently representative of the target distribution—remains untested and could explain the observed transfer performance as an artifact of the particular dataset filter rather than a robust property of the method.
minor comments (2)
- [Abstract / Experiments] The exact conditions under which the 230× and 5700× speedups are measured (hardware, batching, comparison baseline implementation) are stated only in the abstract and should be repeated with precise timing methodology in the main text or appendix.
- [Figures and Tables] Figure legends and table captions would benefit from explicit statements of the number of datasets, the exact filtering criteria, and whether the reported metrics are averages or per-dataset values.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to improve the clarity and robustness of the manuscript.
read point-by-point responses
-
Referee: [Results section] Results section (tables reporting performance on the 18 OpenML-CC18 datasets): the headline claims of outperformance over boosted trees and parity with AutoML lack error bars, standard deviations across runs, or statistical significance tests, making it impossible to judge whether the observed differences are reliable or could be explained by dataset-specific variance.
Authors: We agree that error bars and statistical tests are important for assessing reliability. In the revised manuscript we will add standard deviations computed over multiple random seeds to the performance tables and include pairwise statistical significance tests (e.g., Wilcoxon signed-rank test with p-values) comparing TabPFN against the baselines. revision: yes
-
Referee: [Method section] Method section describing the prior (the construction of the structural causal model prior used to generate synthetic training data): the paper provides no ablation of the prior components (e.g., preference for simple structures, choice of causal mechanisms) and no sensitivity analysis showing how performance changes when these choices are varied, which is load-bearing for the claim that the learned ICL procedure generalizes.
Authors: We acknowledge that explicit ablations would strengthen the justification of the prior design. We will add a sensitivity analysis in the appendix that varies key components such as the structural simplicity bias and causal mechanism choices, reporting their effect on downstream classification performance on the evaluation datasets. revision: yes
-
Referee: [Evaluation / Experiments] No section reports a direct distributional comparison (e.g., MMD, moment matching, or feature-interaction statistics) between samples drawn from the synthetic SCM prior and the 18 filtered real OpenML test sets. Without such evidence the central generalization assumption—that the prior is sufficiently representative of the target distribution—remains untested and could explain the observed transfer performance as an artifact of the particular dataset filter rather than a robust property of the method.
Authors: We agree that a direct comparison would better support the generalization claim. We will add a new subsection (or appendix) that reports distributional comparisons, including maximum mean discrepancy (MMD) and selected moment and interaction statistics, between samples from the synthetic SCM prior and the real OpenML datasets. revision: yes
Circularity Check
No significant circularity: training on synthetic prior, evaluation on held-out real data
full rationale
The paper trains TabPFN offline on synthetic datasets drawn from an explicit prior over structural causal models to approximate Bayesian inference, then evaluates generalization on filtered real OpenML-CC18 datasets (and an additional 67 datasets). The headline performance claims (outperformance of boosted trees, parity with AutoML) are measured on these held-out real instances and are not forced by construction from the training inputs. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central empirical result to the authors' own prior work appear in the provided text. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A broad space of structural causal models with preference for simple structures generates synthetic data whose distribution is close enough to real tabular classification problems for the trained network to generalize.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
Privacy Auditing with Zero (0) Training Run
Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
-
FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
-
Quantifying the Risk-Return Tradeoff in Forecasting
Forecast loss differentials are reframed as returns and assessed with risk-adjusted finance metrics, showing professional forecasters are harder to beat on risk-adjusted performance than on raw accuracy in US macro fo...
-
Data Language Models: A New Foundation Model Class for Tabular Data
Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
-
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...
-
PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals
PHBench shows Product Hunt launch signals predict Series A funding with an ensemble model reaching AP 0.037 and F0.5 0.097 on blind test data, outperforming logistic regression and zero-shot LLMs.
-
Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models
Time series foundation models match the performance of specialized models for day-ahead load forecasting while providing explanations that match domain knowledge on weather and calendar effects.
-
Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models
TabDistill distills feature interactions from tabular foundation models via post-hoc attribution and inserts them into GAMs, yielding consistent predictive gains.
-
Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models
The authors release the first Slovene ESG sentiment dataset from news and report that large language models lead on environmental and social classification while fine-tuned SloBERTa performs best on governance.
-
Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
RCT couples an LLM and Random Forest via RL feedback so each augments the other's features and rewards, producing consistent gains on three medical datasets.
-
TabPFN-3: Technical Report
TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.
-
LGB+: A Macroeconomic Forecasting Road Test
LGB+ improves macroeconomic forecasts by letting linear basis functions compete with or alternate against tree updates inside gradient boosting, yielding native linear/nonlinear decomposition of predictions.
-
CarCrashNet: A Large-Scale Dataset and Hierarchical Neural Solver for Data-Driven Structural Crash Simulation
CarCrashNet releases a large-scale open benchmark dataset of structural crash simulations and a hierarchical neural solver for data-driven full-vehicle crash prediction.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors
Decoupled PFNs use controllable synthetic priors to train separate latent-signal and noise heads, making epistemic-aleatoric decomposition identifiable and improving acquisition in noisy settings.
-
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
TFM-Retouche is an input-space residual adapter that lifts TabICLv2 performance by 56 Elo points on 51 tabular datasets while remaining architecture-agnostic and computationally light.
-
Tabular foundation models for in-context prediction of molecular properties
Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
-
From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning
Spline encodings for numerical features show task-dependent performance in tabular deep learning, with piecewise-linear encoding robust for classification and variable results for regression depending on spline family...
-
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast productio...
-
Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings
TabPFN reaches AUC 0.892 for 3-year MCI-to-AD conversion on TADPOLE data and holds performance at N=50 training samples where XGBoost, Random Forest, LightGBM, and logistic regression degrade.
-
Optimizing IoT Intrusion Detection with Tabular Foundation Models for Smart City Forensics
TabPFNv2.5 delivers 40x faster inference than Random Forest at 97% binary accuracy on TON IoT data, enabling a hybrid pipeline for real-time IoT threat screening in smart cities.
-
Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms
TabPFN maintains high ROC-AUC and structured attention under controlled additions of irrelevant features, nonlinear correlations, and mislabeled targets in binary classification.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
I. Beltagy, M. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv:2004.05150 [cs.CL],
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
V . Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci. Deep neural networks and tabular data: A survey. arXiv:2110.01889 [cs.LG],
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...
work page 1901
-
[4]
URL https://proceedings.neurips.cc/paper_files/paper/ 2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mini...
work page 2020
- [5]
-
[6]
Autogluon-tabular: Robust and accurate automl for structured data,
URL http://archive. ics.uci.edu/ml. N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluon-tabular: Robust and accurate automl for structured data. arXiv:2003.06505 [stat.ML],
-
[7]
12 Published as a conference paper at ICLR 2023 M
Available for free at http://automl.org/book. 12 Published as a conference paper at ICLR 2023 M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Proceedings of the 28th International Conference on Advances i...
work page 2023
- [8]
-
[9]
L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data? arXiv:2207.08815 [cs.LG],
- [10]
-
[11]
13 Published as a conference paper at ICLR 2023 S
Published online: iclr.cc. 13 Published as a conference paper at ICLR 2023 S. Müller, N. Hollmann, S. Arango, J. Grabocka, and F. Hutter. Transformers can do bayesian inference. In Proceedings of the International Conference on Learning Representations (ICLR’22),
work page 2023
-
[12]
D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters. Anchor regression: heterogeneous data meets causality. arXiv:1801.06229 [stat.ME],
-
[13]
SAINT: Improved neural networks for tabular data,
G. Somepalli, M. Goldblum, A. Schwarzschild, C. Bruss, and T. Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv:2106.01342 [cs.LG],
-
[14]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research , 15: 1929–1958,
work page 1929
-
[15]
Accessed: 2022-09-28. J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine learning. SIGKDD Explorations, 15(2):49–60,
work page 2022
-
[16]
14 Published as a conference paper at ICLR 2023 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Proceedings of the 30th International Conference on Advances in Neural Informa...
work page 2023
-
[17]
Boosting ensemble accuracy by revisiting ensemble diversity metrics
Yanzhao Wu, Ling Liu, Zhongwei Xie, Ka-Ho Chow, and Wenqi Wei. Boosting ensemble accuracy by revisiting ensemble diversity metrics. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16464–16472,
work page 2021
-
[18]
doi: 10.1109/CVPR46437.2021.01620. I. Yeo and R. Johnson. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4):954–959,
-
[19]
Published online: iclr.cc. 15 Published as a conference paper at ICLR 2023 A L IMITATIONS The runtime and memory usage of the Transformer-based PFN architecture used in this work scales quadratically with the number of inputs, i.e., training samples passed. Thus, inference on larger sequences (> 100
work page 2023
-
[20]
is hard on current consumer GPUs. A growing number of methods seek to tackle this issue and report similar performances while scaling linearly with the number of inputs (Zaheer et al., 2020; Beltagy et al., 2020). These methods can be integrated into the PFN architecture and thus into the TabPFN. Furthermore, in our experiments we limit the number of feat...
work page 2020
-
[21]
16 Published as a conference paper at ICLR 2023 1s 5s 30s 5min 1h 0.84 0.85 0.86 0.87 0.88 0.89 0.9 Given Time Budget Mean ROC AUC 1s 5s 30s 5min 1h 0 5 10 15 20 25 Given Time Budget ROC AUC Wins 1s 5s 30s 5min 1h 1 2 3 4 5 6 7 8 9 10 Given Time Budget Mean ROC AUC Rank TabPFN KNN Reg. Cocktails Log Catboost SAINT LightGBM XGBoost Autogluon Auto-sklearn 2...
work page 2023
-
[22]
We note that these datasets are not disjoint from our evaluation datasets; in fact 3 of these (“credit-g”, “vehicle”, and “blood-transfusion-service-center”) were included in our evaluation datasets from the OpenML-CC18 Benchmark as well, while one dataset (“Australian”) was included in our list of 150 meta-validation datasets. The evaluation using anothe...
work page 2022
-
[23]
If available, all baselines are given ROC AUC optimization as an objective, others optimize CE
This is due to our model’s prior, which prefers 8Available at https://github.com/openml/automlbenchmark 17 Published as a conference paper at ICLR 2023 Table 2: ROC AUC OVO results on the30 small OpenML-CC18 (including datasets with categorical features and missing values) for 60 minutes requested time per dataset and per split. If available, all baseline...
work page 2023
-
[24]
Here, we evaluate the training set cross-entropy loss on synthetic data generated from random SCMs. The number of training samples and the complexity of the generated data (number of hidden units in the data generating graph) is varied. B.3.2 R OBUSTNESS TO UNINFORMATIVE FEATURES Tabular datasets contain a large fraction of uninformative features (Grinszt...
work page 2022
-
[25]
We find that TabPFN and MLPs are less robust to uninformative features than LightGBM
Uninformative features are generated by copying existing features and shuffling their values randomly between samples. We find that TabPFN and MLPs are less robust to uninformative features than LightGBM. TabPFN could be adapted by including more uninformative features in the used prior. In a second experiment we drop an increasingly large fraction of fea...
work page 2023
-
[26]
in order to assess the generality of our results and to better understand 9The dataset f lags was removed as not enough splits could be generated by our code. 20 Published as a conference paper at ICLR 2023 1000 2000 3000 4000 5000 0.73 0.74 0.75 0.76 0.77 0.78 0.79 Training samples ROC AUC Mean Figure 10: Extrapolation performance of our TabPFN to datase...
work page 2023
-
[27]
its strengths and weaknesses. We now also include comparisons with default random forests (RFs), support vector machines (SVMs), default XGBoost, etc. , along with their tuned versions after one hour. This evaluation used 5 splits and followed the same experimental setup described in Appendix F. Figure 12 shows results for our 30 test datasets (separated ...
work page 2023
-
[28]
Baselines were tuned for one hour or until 10000 configurations were exhausted (Log
default XGB tuned XGB Figure 12: ROC AUC comparison on the OpenML-CC18 Benchmark. Baselines were tuned for one hour or until 10000 configurations were exhausted (Log. Reg and KNN). introduced by Demšar (2006). We use the Wilcoxon rank test and correct for multiple testing using Holm–Bonferroni method. We use a a significance level alpha = 0.05 for our cri...
work page 2006
-
[29]
Baselines were tuned for one hour or until 10 000 configurations were exhausted (Log
default XGB tuned XGB Figure 13: ROC AUC comparison on 149 validation datasets (see Table 8). Baselines were tuned for one hour or until 10 000 configurations were exhausted (Log. Reg and KNN). 23 Published as a conference paper at ICLR 2023 Fast Baselines 12345 3.48 default SVM 3.37 default XGB 3.17 default GP (RBF) 2.87 default LGBM 2.11 T abPFN Numeric...
work page 2023
-
[30]
TabPFN still works well in several datasets where hyperparameter optimization does not help in the baselines (maybe due to overfitting), e.g. Pizza-cutter1 and arsenic-female-bladder, 25 Published as a conference paper at ICLR 2023 (a) Touch-2, purely categorical (b) vehicle Figure 17: Strong results for TabPFN for individual datasets with numerical featu...
work page 2023
-
[31]
It is therefore necessary to look beyond individual datasets to obtain the entire picture. C D ETAILS OF THE TABPFN P RIOR C.1 SCM P RIOR The Sampling Algorithm We instantiate a subfamily of DAGs that can be efficiently sampled from by starting with a MLP architecture and dropping weights from it. That is, to sample a dataset with k features and n samples...
work page 2023
-
[32]
C.2 T ABULAR DATA REFINEMENTS Tabular datasets comprise a range of peculiarities, e.g. feature types can be numerical, ordinal, or categorical and feature values can be missing, leading to sparse features. We seek to reflect these peculiarities in the design of our prior as described in the following sections. C.2.1 P REPROCESSING During prior-fitting, in...
work page 2000
-
[33]
This improvesperformance by 2% in this smaller scale setup, which is a bigger difference than between the performance of the final TabPFN and all baselines besides KNN and SAINT. D D ETAILS OF THE PRIOR -DATA FITTED NETWORK ALGORITHM Algorithm 1 describes the training method proposed by Müller et al. (2022) for PFNs. Algorithm 1: Prior-fitting of a PFN (M...
work page 2022
-
[34]
with linear-warmup and cosine annealing (Loshchilov and Hutter, 2017). For each training we tested a set of 3 learning rates, {.001, .0003, .0001}, and used the one with the lowest final training loss. The resulting model contains 25.82 M parameters. E.2 PFN A RCHITECTURE ADAPTATIONS Attention Adaption The original PFN architecture (Müller et al.,
work page 2017
-
[35]
to compute the attention between all the training examples, as well as, the attention from validation examples to training examples. We replaced this, with two 10To avoid misunderstandings and give an example, a rotation of column indices by 2 positions would change the columns of the X matrix from [x1, x2, x3, x4] to [x3, x4, x1, x2]. This is not a rotat...
work page 2023
-
[36]
Our encoder changes to accomodate this training and inference with different numbers of features by zero-padding datasets where the number of features k is smaller than the maximum number of features K and scaling these features by K k , s.t. the magnitude stays the same. E.3 T ABPFN T RAINING We trained our final model for18 000 steps with a batch size o...
work page 2080
-
[37]
F D ETAILS FOR TABULAR EXPERIMENTS Here we provide additional details for the experiments conducted in Section 5 in the main paper. F.1 H ARDWARE SETUP All evaluations, including the baselines, ran on a compute cluster equipped with Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz using 1 CPU with up to 6GB RAM. For evaluation using our TabPFN, we additionally us...
work page 2080
-
[38]
ForCatBoost and XGBoost, we used the same ranges as Shwartz-Ziv and Armon (2022) with the following exception: For CatBoost we removed the hyperparameter max_size since we could not find it in the official documentation. To be maximally fair to XGBoost, we also tried the search space of quadruple Kaggle grandmaster Bojan Tunguz (Tunguz, 2022), which we ad...
work page 2022
-
[39]
Second, our meta-validation set (see Tables 8 and
comprises all datasets in the OpenML-CC18 benchmark suite (Bis- chl et al., 2021)(available at OpenML.org) with at most 2 000 samples, 100 features and 10 classes, which leaves us with 30 datasets that represent small, tabular datasets. Second, our meta-validation set (see Tables 8 and
work page 2021
-
[40]
comprises 150 datasets fromOpenML.org (Van- schoren et al., 2014). For this, we considered all datasets on OpenML.org and applied the following filtering procedure: We dropped all datasets that are in the meta-test set and all datasets with more than 2 000 samples, 100 features or 10 classes. We also manually checked for overlaps and removed datasets wher...
work page 2014
-
[41]
For this evaluation, datasets with more than 100 features are limited to the first 100 features. When more than 10 classes are contained in the datasets, samples with any but the first 10 classes are discarded. F.5 D ETAILS ON TIME COMPARISONS Time comparisons refer to combined fitting, tuning and prediction; see Table 2 for the times split into tuning/fi...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.