TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single forward pass without further updates.
hub
Why do tree-based models still outperform deep learning on tabular data?
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Introduces graph-to-image prediction of per-node dynamic stability landscapes in oscillator networks from topology, releases two 10k-graph datasets, and shows GNN-CNN models achieve good accuracy with cross-size generalization.
Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
RCT couples an LLM and Random Forest via RL feedback so each augments the other's features and rewards, producing consistent gains on three medical datasets.
AXIL computes exact fixed-structure instance attributions for squared-error GBMs via a matrix-free O(TN) backward operator, outperforming BoostIn/TREX/LeafInfluence on 20 regression datasets.
ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and dissociation between accuracy and calibration.
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
UniRec unifies heterogeneous recommendation modalities via specialized encoders, triplet representations, and hierarchical modeling to outperform prior multimodal LLM recommenders by up to 15% on benchmarks.
XGBoost classifier filters interlopers in CSST slitless spectroscopy simulations, retaining 42% of galaxies with 96.6% accurate redshifts and 0.13% outliers.
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
A Fréchet-based random-effects algorithm with M-estimation consistency guarantees is proposed for modeling non-Euclidean random objects in general metric spaces.
Gradient boosting produces risk scores with competitive accuracy but 60% fewer rules on classification tasks and 16% fewer on time-to-event tasks than regression-based methods like AutoScore.
Scaling experiments on structured medical claims data show task-dependent saturation: disease incidence prediction benefits from models up to 101M parameters while medication prediction saturates at 11M, with all models outperforming a LightGBM baseline.
A blockchain-anchored explainable ML system delivers tamper-evident fraud detection with F1 of 0.895 and sub-25ms latency on Layer-2 networks.
A combined kitchen sink observable set of Energy Flow Polynomials and subjettiness variables outperforms standard baselines in sensitivity to a wide range of resonant signals, with new public benchmarks released and an attribute bagging variant reducing training cost.
Cooperative coevolution and monolithic evolution achieve similar performance gains over baselines in low-label semi-supervised tabular classification.
TabPFNv2.5 delivers 40x faster inference than Random Forest at 97% binary accuracy on TON IoT data, enabling a hybrid pipeline for real-time IoT threat screening in smart cities.
citing papers explorer
-
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single forward pass without further updates.
-
Learning Dynamic Stability Landscapes in Synchronization Networks
Introduces graph-to-image prediction of per-node dynamic stability landscapes in oscillator networks from topology, releases two 10k-graph datasets, and shows GNN-CNN models achieve good accuracy with cross-size generalization.
-
Data Language Models: A New Foundation Model Class for Tabular Data
Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
-
Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
RCT couples an LLM and Random Forest via RL feedback so each augments the other's features and rewards, producing consistent gains on three medical datasets.
-
AXIL: Exact Instance Attribution for Gradient Boosting
AXIL computes exact fixed-structure instance attributions for squared-error GBMs via a matrix-free O(TN) backward operator, outperforming BoostIn/TREX/LeafInfluence on 20 regression datasets.
-
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder
ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and dissociation between accuracy and calibration.
-
Prior-Aligned Data Cleaning for Tabular Foundation Models
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
-
UniRec: Unified Multimodal Encoding for LLM-Based Recommendations
UniRec unifies heterogeneous recommendation modalities via specialized encoders, triplet representations, and hierarchical modeling to outperform prior multimodal LLM recommenders by up to 15% on benchmarks.
-
Filtering Interlopers with Photometry and Diagnostic Features: A Machine Learning Framework Validated with CSST Slitless Spectroscopy
XGBoost classifier filters interlopers in CSST slitless spectroscopy simulations, retaining 42% of galaxies with 96.6% accurate redshifts and 0.13% outliers.
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
-
Random-Effects Algorithm for Random Objects in Metric Spaces
A Fréchet-based random-effects algorithm with M-estimation consistency guarantees is proposed for modeling non-Euclidean random objects in general metric spaces.
-
Gradient Boosted Risk Scores
Gradient boosting produces risk scores with competitive accuracy but 60% fewer rules on classification tasks and 16% fewer on time-to-event tasks than regression-based methods like AutoScore.
-
A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
Scaling experiments on structured medical claims data show task-dependent saturation: disease incidence prediction benefits from models up to 101M parameters while medication prediction saturates at 11M, with all models outperforming a LightGBM baseline.
-
Who Audits the Auditor? Tamper-Proof Fraud Detection with Blockchain-Anchored Explainable ML
A blockchain-anchored explainable ML system delivers tamper-evident fraud detection with F1 of 0.895 and sub-25ms latency on Layer-2 networks.
-
Kitchen Sink Anomaly Detection
A combined kitchen sink observable set of Energy Flow Polynomials and subjettiness variables outperforms standard baselines in sensitivity to a wide range of resonant signals, with new public benchmarks released and an attribute bagging variant reducing training cost.
-
Cooperative Coevolution versus Monolithic Evolutionary Search for Semi-Supervised Tabular Classification
Cooperative coevolution and monolithic evolution achieve similar performance gains over baselines in low-label semi-supervised tabular classification.
-
Optimizing IoT Intrusion Detection with Tabular Foundation Models for Smart City Forensics
TabPFNv2.5 delivers 40x faster inference than Random Forest at 97% binary accuracy on TON IoT data, enabling a hybrid pipeline for real-time IoT threat screening in smart cities.