MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
hub
arXiv preprint arXiv:1708.03731 , year=
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TabArena launches a dynamic, updatable benchmarking system for tabular ML that shows boosted trees remain competitive, deep learning matches them under larger budgets with ensembling, foundation models excel on small data, and cross-model ensembles advance SOTA while flagging validation overfitting.
Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Ternary decision trees with locally-adaptive uncertainty zones estimated from CART statistics improve decided accuracy over standard trees by blending boundary predictions and flagging uncertain cases.
O'Prior, a compositional synthetic prior with hierarchical SCMs, realism engines, stress modules, and curriculum protocols, improves tabular foundation model accuracy and robustness on real benchmarks when architecture and compute are held fixed.
TAP couples a learner-conditioned policy with diffusion inpainting to generate and selectively inject high-utility tabular augmentations, yielding up to 15.6 pp accuracy gains and 32% RMSE reduction on seven datasets under severe scarcity.
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
Two-stage optimization for ML workflows that prioritizes data pipeline search over hyperparameter tuning, with time-allocation policies and a specificity metric for pruning.
citing papers explorer
-
MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
-
TabArena: A Living Benchmark for Machine Learning on Tabular Data
TabArena launches a dynamic, updatable benchmarking system for tabular ML that shows boosted trees remain competitive, deep learning matches them under larger budgets with ensembling, foundation models excel on small data, and cross-model ensembles advance SOTA while flagging validation overfitting.
-
Data Language Models: A New Foundation Model Class for Tabular Data
Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
-
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
-
Ternary Decision Trees with Locally-Adaptive Uncertainty Zones
Ternary decision trees with locally-adaptive uncertainty zones estimated from CART statistics improve decided accuracy over standard trees by blending boundary predictions and flagging uncertain cases.
-
Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality
O'Prior, a compositional synthetic prior with hierarchical SCMs, realism engines, stress modules, and curriculum protocols, improves tabular foundation model accuracy and robustness on real benchmarks when architecture and compute are held fixed.
-
Active Tabular Augmentation via Policy-Guided Diffusion Inpainting
TAP couples a learner-conditioned policy with diffusion inpainting to generate and selectively inject high-utility tabular augmentations, yielding up to 15.6 pp accuracy gains and 32% RMSE reduction on seven datasets under severe scarcity.
-
Prior-Aligned Data Cleaning for Tabular Foundation Models
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
-
Two-stage Optimization for Machine Learning Workflow
Two-stage optimization for ML workflows that prioritizes data pipeline search over hyperparameter tuning, with time-allocation policies and a specificity metric for pruning.
- Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment