pith. machine review for the scientific record. sign in

arxiv: 2603.29928 · v3 · submitted 2026-03-31 · 💻 cs.AI

Recognition: no theorem link

ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:16 UTC · model grok-4.3

classification 💻 cs.AI
keywords tabular foundation modelsproper scoring rulesbenchmarkregressionCRPSpredictive distributionsmodel evaluationmodel ranking
0
0 comments X

The pith

Proper scoring rules reorder which tabular foundation models rank highest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tabular foundation models output full predictive distributions, yet standard benchmarks score them only with point estimates such as RMSE. ScoringBench applies a suite of proper scoring rules including CRPS, CRLS, interval score, energy score, and weighted CRPS to 97 regression datasets. The evaluation shows that model rankings shift substantially with the choice of rule: models strong on point metrics often rank lower on probabilistic ones, and the top model under one rule can drop under another. This matters because high-stakes applications assign unequal costs to different error types. The benchmark supplies two ranking methods and an open leaderboard for further contributions.

Core claim

Evaluating several models spanning in-context learners, fine-tuned foundation models, gradient-boosted trees, and MLPs, we find that model rankings shift substantially depending on the scoring rule: models that excel on point-estimate metrics can rank poorly on probabilistic ones, and the top-performing model under one proper scoring rule may rank noticeably lower under another.

What carries the argument

ScoringBench, an extensible benchmark that scores full predictive distributions from tabular regression models with proper scoring rules on 97 datasets and supplies both ordinal and z-score ranking protocols.

If this is right

  • Model selection for domains where tail errors are costly should incorporate probabilistic scoring rules rather than default to RMSE or R-squared.
  • In-context learners and fine-tuned foundation models can trade places with gradient-boosted trees depending on the chosen scoring rule.
  • The two ranking protocols (Demsar/autorank and z-score) can produce different orderings even under the same scoring rule.
  • Community extensions via the git-based leaderboard can add domain-specific weighted scoring rules to reflect application costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that previously ranked tabular models on point metrics alone may need re-running with proper scoring rules before deployment decisions are made.
  • Model developers could add proper scoring rules directly to training objectives to improve performance on the metrics that matter for high-stakes use.
  • Applications with known asymmetric loss functions could derive custom weighted CRPS variants from the benchmark infrastructure.

Load-bearing premise

The 97 datasets and the five chosen proper scoring rules are representative enough to support general statements about which models are preferable when error costs are asymmetric.

What would settle it

Stable model rankings across point-estimate metrics and proper scoring rules on a fresh collection of datasets would falsify the claim that rankings shift substantially.

Figures

Figures reproduced from arXiv: 2603.29928 by Jonas Landsgesell, Pascal Knoll, Tizian Wenzel.

Figure 1
Figure 1. Figure 1: ScoringBench: Evaluating tabular regression models with proper scoring rules. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ranking heatmaps summarizing the performance of different models on different scoring [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Critical Difference (CD) diagram for CRPS: Models are positioned by autorank on a horizontal [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on dataset size. We re-run the benchmark with a reduced capped dataset size [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet prevailing regression benchmarks evaluate them almost exclusively via point-estimate metrics (RMSE, $R^2$). This discards precisely the distributional information these models are designed to provide - a critical gap for high-stakes domains where not all kinds of errors are equally costly. We introduce ScoringBench, an open and extensible benchmark that evaluates tabular regression models under a comprehensive suite of proper scoring rules - including CRPS, CRLS, interval score, energy score, and weighted CRPS - alongside standard point metrics. ScoringBench covers 97 regression datasets from diverse domains, supports transparent community contributions via a git-based leaderboard, and provides two complementary ranking protocols: an ordinal Demsar/autorank approach and a magnitude-preserving z-score ranking approach. Evaluating several models - spanning in-context learners, fine-tuned foundation models, gradient-boosted trees, and MLPs - we find that model rankings shift substantially depending on the scoring rule: models that excel on point-estimate metrics can rank poorly on probabilistic ones, and the top-performing model under one proper scoring rule may rank noticeably lower under another. These results demonstrate that the choice of evaluation metric is not a technicality but a modelling decision - and, for applications where e.g. tail errors are disproportionately costly, a domain-specific requirement with direct consequences for model deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ScoringBench, an open benchmark for tabular regression that evaluates models (including TabPFN, TabICL, gradient-boosted trees, and MLPs) on 97 datasets using proper scoring rules (CRPS, CRLS, interval score, energy score, weighted CRPS) in addition to point metrics. It reports that model rankings shift substantially across scoring rules, with point-estimate leaders often ranking lower under probabilistic metrics, and provides ordinal and z-score ranking protocols plus a git-based leaderboard.

Significance. If the reported ranking shifts are correctly computed, the work supplies concrete evidence that evaluation metric choice is a substantive modeling decision with direct consequences for high-stakes deployment. The emphasis on proper scoring rules for distribution-outputting foundation models fills a documented gap, and the extensible leaderboard format supports reproducibility and community extension.

major comments (2)
  1. [§4.3] §4.3 and Table 4: the z-score aggregation across the 97 datasets does not report per-metric standard errors or dataset-level variance; without these, it is unclear whether the reported rank reversals exceed sampling variability and therefore whether the 'substantial shift' claim is robust to dataset subsampling.
  2. [§3.2] §3.2: the parameterization of the weighted CRPS (weight function, tail emphasis) is not fully specified; different choices can alter which models appear superior, so the sensitivity of the main ranking results to this hyper-parameter should be shown.
minor comments (2)
  1. [Abstract] The abstract states that two ranking protocols are provided but does not name them; adding the names (Demsar/autorank and z-score) would improve immediate clarity.
  2. [§3] A compact summary table listing each proper scoring rule, its formula, and its sensitivity properties would help readers compare the metrics without consulting external references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [§4.3] §4.3 and Table 4: the z-score aggregation across the 97 datasets does not report per-metric standard errors or dataset-level variance; without these, it is unclear whether the reported rank reversals exceed sampling variability and therefore whether the 'substantial shift' claim is robust to dataset subsampling.

    Authors: We agree that reporting standard errors would strengthen the robustness of the rank-reversal claims. In the revised manuscript we will add per-metric standard errors (computed via bootstrap resampling over the 97 datasets) to the z-score results in §4.3 and Table 4. This will allow readers to evaluate whether the observed shifts exceed sampling variability. revision: yes

  2. Referee: [§3.2] §3.2: the parameterization of the weighted CRPS (weight function, tail emphasis) is not fully specified; different choices can alter which models appear superior, so the sensitivity of the main ranking results to this hyper-parameter should be shown.

    Authors: We thank the referee for this observation. We will fully specify the exact weight function (linear tail emphasis with w(u) = 2u for the lower tail and w(u) = 2(1-u) for the upper tail) in §3.2. In addition, we will include a short sensitivity analysis (in the appendix) showing how the main rankings change under alternative weight functions (uniform and quadratic tail emphasis) to demonstrate that the reported shifts are not driven by this particular choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical benchmark evaluation using externally defined proper scoring rules (CRPS, CRLS, interval score, energy score, weighted CRPS) applied to public datasets. Model rankings are computed directly from held-out performance metrics rather than derived from any internal equations or fitted parameters that would reduce the result to its own inputs. No self-citations are load-bearing for the central claim, and the observed ranking shifts are verifiable observations without self-definitional or ansatz-smuggling steps. The derivation chain is self-contained and consists of transparent computation on independent data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces no new mathematical entities or fitted parameters. It relies on the standard definition of proper scoring rules and the assumption that the collected datasets are representative. No free parameters are introduced to produce the reported rankings.

axioms (2)
  • domain assumption Proper scoring rules are the appropriate metric family for evaluating predictive distributions in regression.
    Invoked when the authors state that point metrics discard distributional information and that proper scores should be used instead.
  • domain assumption The 97 regression datasets are sufficiently diverse and representative for drawing general conclusions about tabular model performance.
    Required for the claim that ranking shifts are a general phenomenon rather than dataset-specific.

pith-pipeline@v0.9.0 · 5555 in / 1355 out tokens · 65043 ms · 2026-05-13T23:16:18.684368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    These suites are fetched programmatically via the OpenML Python API and serve as the base layer of our collection

    OpenML benchmark suites.We include three curated regression suites hosted on OpenML [Van- schoren et al., 2014]: • Suite 297 (OpenML-CTR23): a community-curated tabular regression benchmark [Fischer et al., 2023]; •Suite 299: an additional OpenML regression collection; •Suite 269: an OpenML regression collection collating further community-vetted datasets...

  2. [2]

    PMLB provides a standardised, version-controlled archive of datasets widely used in AutoML research

    PMLB (Penn Machine Learning Benchmarks).Regression datasets from PMLB [Olson et al., 2017], distributed as compressed TSV files on GitHub. PMLB provides a standardised, version-controlled archive of datasets widely used in AutoML research

  3. [3]

    KEEL datasets add coverage of domains less represented in OpenML and PMLB, such as financial time series and energy consumption

    KEEL repository.Regression datasets from the KEEL data-mining repository [Derrac et al., 2015], distributed as zipped .dat files. KEEL datasets add coverage of domains less represented in OpenML and PMLB, such as financial time series and energy consumption

  4. [4]

    This set expands domain coverage to areas including real estate, book reviews, sensor fusion, and sports analytics

    TALENT / OpenML verified regression datasets.Additional OpenML datasets drawn from the TALENT benchmark [Liu et al., 2025] and independently verified as regression tasks. This set expands domain coverage to areas including real estate, book reviews, sensor fusion, and sports analytics. B.2 Dataset construction and filtering B.2.1 Deduplication Procedure A...

  5. [5]

    Datasets whose normalised names are identical are treated as duplicates

    Exact normalised match.Dataset names are normalised by lowercasing, stripping leading numeric prefixes (e.g., 197 cpu act→cpuact ), removing separators and special characters, and collapsing repeated characters. Datasets whose normalised names are identical are treated as duplicates. 14

  6. [6]

    This catches common patterns such as houses matching californiahousing

    Substring match.For names longer than three characters, we check whether either normalised name is a substring of the other. This catches common patterns such as houses matching californiahousing

  7. [7]

    Fuzzy match.We compute a pairwise similarity ratio (via Python’s SequenceMatcher) and flag pairs with≥85 % similarity as duplicates

  8. [8]

    All checks above are re-applied against these keys

    Explicit deduplication keys.Many datasets are annotated with known aliases (e.g., cpu act ↔cpuact). All checks above are re-applied against these keys. When a duplicate is detected, we only retain one version and discarding the others, using the (arbitrary) precedence order (OpenML>PMLB>KEEL>scikit-learn). B.2.2 Validation Filters After deduplication, eve...

  9. [9]

    Ensure it adheres to the expected interface for seamless integration

    Add your custom wrapper with a unique name to the folder scoringbench/wrapper/ following the template of existing wrappers. Ensure it adheres to the expected interface for seamless integration

  10. [10]

    Run the benchmarking via python run bench regression.py (or parallelize via slurm with sbatch array=0103 run benchmark.sbatch)

  11. [11]

    Run the ranking method by executing python autorank leaderboard.py which will evaluate both Autorank and z-score ranking and generate the leaderboard outputs

  12. [12]

    Since the output repository is separate from the main repository, push to both

    Commit your model parquet file (documenting each run) and the updated leaderboard CSVs in /output/figures/leaderboard/. Since the output repository is separate from the main repository, push to both. This serves as a public ledger and allows traceability

  13. [13]

    Create a pull request to the ScoringBench repository for review; contributions that meet standards will be merged

  14. [14]

    Upon merge, https://scoringbench.com/ will automatically display the updated leaderboard and data is archived in the git lfs repository for reproducibility. 26