Recognition: no theorem link
ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules
Pith reviewed 2026-05-13 23:16 UTC · model grok-4.3
The pith
Proper scoring rules reorder which tabular foundation models rank highest.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating several models spanning in-context learners, fine-tuned foundation models, gradient-boosted trees, and MLPs, we find that model rankings shift substantially depending on the scoring rule: models that excel on point-estimate metrics can rank poorly on probabilistic ones, and the top-performing model under one proper scoring rule may rank noticeably lower under another.
What carries the argument
ScoringBench, an extensible benchmark that scores full predictive distributions from tabular regression models with proper scoring rules on 97 datasets and supplies both ordinal and z-score ranking protocols.
If this is right
- Model selection for domains where tail errors are costly should incorporate probabilistic scoring rules rather than default to RMSE or R-squared.
- In-context learners and fine-tuned foundation models can trade places with gradient-boosted trees depending on the chosen scoring rule.
- The two ranking protocols (Demsar/autorank and z-score) can produce different orderings even under the same scoring rule.
- Community extensions via the git-based leaderboard can add domain-specific weighted scoring rules to reflect application costs.
Where Pith is reading between the lines
- Benchmarks that previously ranked tabular models on point metrics alone may need re-running with proper scoring rules before deployment decisions are made.
- Model developers could add proper scoring rules directly to training objectives to improve performance on the metrics that matter for high-stakes use.
- Applications with known asymmetric loss functions could derive custom weighted CRPS variants from the benchmark infrastructure.
Load-bearing premise
The 97 datasets and the five chosen proper scoring rules are representative enough to support general statements about which models are preferable when error costs are asymmetric.
What would settle it
Stable model rankings across point-estimate metrics and proper scoring rules on a fresh collection of datasets would falsify the claim that rankings shift substantially.
Figures
read the original abstract
Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet prevailing regression benchmarks evaluate them almost exclusively via point-estimate metrics (RMSE, $R^2$). This discards precisely the distributional information these models are designed to provide - a critical gap for high-stakes domains where not all kinds of errors are equally costly. We introduce ScoringBench, an open and extensible benchmark that evaluates tabular regression models under a comprehensive suite of proper scoring rules - including CRPS, CRLS, interval score, energy score, and weighted CRPS - alongside standard point metrics. ScoringBench covers 97 regression datasets from diverse domains, supports transparent community contributions via a git-based leaderboard, and provides two complementary ranking protocols: an ordinal Demsar/autorank approach and a magnitude-preserving z-score ranking approach. Evaluating several models - spanning in-context learners, fine-tuned foundation models, gradient-boosted trees, and MLPs - we find that model rankings shift substantially depending on the scoring rule: models that excel on point-estimate metrics can rank poorly on probabilistic ones, and the top-performing model under one proper scoring rule may rank noticeably lower under another. These results demonstrate that the choice of evaluation metric is not a technicality but a modelling decision - and, for applications where e.g. tail errors are disproportionately costly, a domain-specific requirement with direct consequences for model deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ScoringBench, an open benchmark for tabular regression that evaluates models (including TabPFN, TabICL, gradient-boosted trees, and MLPs) on 97 datasets using proper scoring rules (CRPS, CRLS, interval score, energy score, weighted CRPS) in addition to point metrics. It reports that model rankings shift substantially across scoring rules, with point-estimate leaders often ranking lower under probabilistic metrics, and provides ordinal and z-score ranking protocols plus a git-based leaderboard.
Significance. If the reported ranking shifts are correctly computed, the work supplies concrete evidence that evaluation metric choice is a substantive modeling decision with direct consequences for high-stakes deployment. The emphasis on proper scoring rules for distribution-outputting foundation models fills a documented gap, and the extensible leaderboard format supports reproducibility and community extension.
major comments (2)
- [§4.3] §4.3 and Table 4: the z-score aggregation across the 97 datasets does not report per-metric standard errors or dataset-level variance; without these, it is unclear whether the reported rank reversals exceed sampling variability and therefore whether the 'substantial shift' claim is robust to dataset subsampling.
- [§3.2] §3.2: the parameterization of the weighted CRPS (weight function, tail emphasis) is not fully specified; different choices can alter which models appear superior, so the sensitivity of the main ranking results to this hyper-parameter should be shown.
minor comments (2)
- [Abstract] The abstract states that two ranking protocols are provided but does not name them; adding the names (Demsar/autorank and z-score) would improve immediate clarity.
- [§3] A compact summary table listing each proper scoring rule, its formula, and its sensitivity properties would help readers compare the metrics without consulting external references.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [§4.3] §4.3 and Table 4: the z-score aggregation across the 97 datasets does not report per-metric standard errors or dataset-level variance; without these, it is unclear whether the reported rank reversals exceed sampling variability and therefore whether the 'substantial shift' claim is robust to dataset subsampling.
Authors: We agree that reporting standard errors would strengthen the robustness of the rank-reversal claims. In the revised manuscript we will add per-metric standard errors (computed via bootstrap resampling over the 97 datasets) to the z-score results in §4.3 and Table 4. This will allow readers to evaluate whether the observed shifts exceed sampling variability. revision: yes
-
Referee: [§3.2] §3.2: the parameterization of the weighted CRPS (weight function, tail emphasis) is not fully specified; different choices can alter which models appear superior, so the sensitivity of the main ranking results to this hyper-parameter should be shown.
Authors: We thank the referee for this observation. We will fully specify the exact weight function (linear tail emphasis with w(u) = 2u for the lower tail and w(u) = 2(1-u) for the upper tail) in §3.2. In addition, we will include a short sensitivity analysis (in the appendix) showing how the main rankings change under alternative weight functions (uniform and quadratic tail emphasis) to demonstrate that the reported shifts are not driven by this particular choice. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical benchmark evaluation using externally defined proper scoring rules (CRPS, CRLS, interval score, energy score, weighted CRPS) applied to public datasets. Model rankings are computed directly from held-out performance metrics rather than derived from any internal equations or fitted parameters that would reduce the result to its own inputs. No self-citations are load-bearing for the central claim, and the observed ranking shifts are verifiable observations without self-definitional or ansatz-smuggling steps. The derivation chain is self-contained and consists of transparent computation on independent data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Proper scoring rules are the appropriate metric family for evaluating predictive distributions in regression.
- domain assumption The 97 regression datasets are sufficiently diverse and representative for drawing general conclusions about tabular model performance.
Reference graph
Works this paper leans on
-
[1]
OpenML benchmark suites.We include three curated regression suites hosted on OpenML [Van- schoren et al., 2014]: • Suite 297 (OpenML-CTR23): a community-curated tabular regression benchmark [Fischer et al., 2023]; •Suite 299: an additional OpenML regression collection; •Suite 269: an OpenML regression collection collating further community-vetted datasets...
work page 2014
-
[2]
PMLB provides a standardised, version-controlled archive of datasets widely used in AutoML research
PMLB (Penn Machine Learning Benchmarks).Regression datasets from PMLB [Olson et al., 2017], distributed as compressed TSV files on GitHub. PMLB provides a standardised, version-controlled archive of datasets widely used in AutoML research
work page 2017
-
[3]
KEEL repository.Regression datasets from the KEEL data-mining repository [Derrac et al., 2015], distributed as zipped .dat files. KEEL datasets add coverage of domains less represented in OpenML and PMLB, such as financial time series and energy consumption
work page 2015
-
[4]
TALENT / OpenML verified regression datasets.Additional OpenML datasets drawn from the TALENT benchmark [Liu et al., 2025] and independently verified as regression tasks. This set expands domain coverage to areas including real estate, book reviews, sensor fusion, and sports analytics. B.2 Dataset construction and filtering B.2.1 Deduplication Procedure A...
work page 2025
-
[5]
Datasets whose normalised names are identical are treated as duplicates
Exact normalised match.Dataset names are normalised by lowercasing, stripping leading numeric prefixes (e.g., 197 cpu act→cpuact ), removing separators and special characters, and collapsing repeated characters. Datasets whose normalised names are identical are treated as duplicates. 14
-
[6]
This catches common patterns such as houses matching californiahousing
Substring match.For names longer than three characters, we check whether either normalised name is a substring of the other. This catches common patterns such as houses matching californiahousing
-
[7]
Fuzzy match.We compute a pairwise similarity ratio (via Python’s SequenceMatcher) and flag pairs with≥85 % similarity as duplicates
-
[8]
All checks above are re-applied against these keys
Explicit deduplication keys.Many datasets are annotated with known aliases (e.g., cpu act ↔cpuact). All checks above are re-applied against these keys. When a duplicate is detected, we only retain one version and discarding the others, using the (arbitrary) precedence order (OpenML>PMLB>KEEL>scikit-learn). B.2.2 Validation Filters After deduplication, eve...
work page 2016
-
[9]
Ensure it adheres to the expected interface for seamless integration
Add your custom wrapper with a unique name to the folder scoringbench/wrapper/ following the template of existing wrappers. Ensure it adheres to the expected interface for seamless integration
-
[10]
Run the benchmarking via python run bench regression.py (or parallelize via slurm with sbatch array=0103 run benchmark.sbatch)
-
[11]
Run the ranking method by executing python autorank leaderboard.py which will evaluate both Autorank and z-score ranking and generate the leaderboard outputs
-
[12]
Since the output repository is separate from the main repository, push to both
Commit your model parquet file (documenting each run) and the updated leaderboard CSVs in /output/figures/leaderboard/. Since the output repository is separate from the main repository, push to both. This serves as a public ledger and allows traceability
-
[13]
Create a pull request to the ScoringBench repository for review; contributions that meet standards will be merged
-
[14]
Upon merge, https://scoringbench.com/ will automatically display the updated leaderboard and data is archived in the git lfs repository for reproducibility. 26
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.