A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
hub
arXiv preprint arXiv:2407.00956 , year=
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TabArena launches a dynamic, updatable benchmarking system for tabular ML that shows boosted trees remain competitive, deep learning matches them under larger budgets with ensembling, foundation models excel on small data, and cross-model ensembles advance SOTA while flagging validation overfitting.
Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 11 models and 142 datasets in the new BeyondArena benchmark.
FlexTab shows a shared encoder with task-specific decoders trained on unlabeled tables can achieve SOTA on classification, regression, anomaly detection and entity matching while staying competitive on relational entity classification.
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
TabDistill distills feature interactions from tabular foundation models via post-hoc attribution and inserts them into GAMs, yielding consistent predictive gains.
A Lagrangian duality method approximates best responses for non-linear strategic classification and enables gradient-based training via the Implicit Function Theorem, yielding improved strategic accuracy on standard datasets.
PACE-GGM selects poorly approximated covariance entries, measures them privately, and reconstructs the full matrix with a maximum-entropy objective to produce a Gaussian graphical model, yielding lower estimation error than uniform perturbation.
BoostLLM trains sequential PEFT adapters in a boosting framework with tree path inputs to improve LLM performance on few-shot tabular classification, matching or exceeding XGBoost.
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
MachineLearningLM uses continued pretraining on SCM-synthesized ML tasks with random-forest distillation to give LLMs robust many-shot in-context learning on tabular classification, reaching random-forest accuracy levels while preserving general chat performance.
xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 31 baselines including GBDTs and TabPFNv2.
Tabular foundation models outperform standard methods in credit risk PD and LGD tasks, with larger gains on smaller datasets when used out-of-the-box.
Benchmark finds some deep learning models match gradient-boosted trees on LIGO glitch classification with fewer parameters and partially consistent feature importance across architectures.
A data-centric AI framework cleans FLIm labels via confident learning and achieves 96% accuracy classifying glioma infiltration into low, moderate, and high cellularity.
citing papers explorer
-
Non-Linear Strategic Classification Made Practical
A Lagrangian duality method approximates best responses for non-linear strategic classification and enables gradient-based training via the Implicit Function Theorem, yielding improved strategic accuracy on standard datasets.